Home Credit Dataset¶

💡 📢 ☑️ remember to read the readme.md file for helpful hints on the best ways to view/navigate this project

If you visualize this notebook on github you will be missing important content

Some charts/diagrams/features are not visible in github. This is standard and well-known behaviour.

Consider viewing the pre-rendered HTML files, or run all notebooks end to end after enabling the feature flags that control long running operations:

If you chose to run this locally, there are some prerequisites:

  • you will need python 3.9
  • you will need to install the dependencies using pip install -r requirements.txt before proceeding.

Problem Statement¶

(provided by Turing College)

You and your friend came up with a brilliant startup idea - provide risk evaluation as a service for retail banks. As with most successful startup teams, both of you have your specialty. Your friend is responsible for sales and operations, while you are responsible for everything product-related, from planning to data analysis to building the solution. You have quickly identified that machine learning will be an essential part of your offering because you believe that the models can capture statistical patterns in the defaults on bank loans.

You decide to start your investigation by downloading this dataset from Home Credit Group. You are not yet sure, what is the most crucial problem for your potential clients, so you had a meeting with your friend to discuss how your proof-of-concept (POC) product should look like. After a lot of arguing, you both agreed to create a number of different models so that you have a robust and diversified offering when you get your first meeting with the potential clients. You are eager to investigate the dataset and see what you can predict, so you propose that you come up with interesting features to analyze and predict - this way, you'll focus on building a solid offering, and she can work on getting meetings with the banks.

Objectives for this Part¶

  • Practice translating business requirements into data science tasks.
  • Practice performing EDA.
  • Practice applying statistical inference procedures.
  • Practice using machine learning to solve business problems.
  • Practice deploying multiple machine learning models.

Requirements¶

  • Download the data from here and the data description from here.
  • Create a plan for your investigation, analysis, and POC building. This should include your assumptions, overall objectives, and objectives for each step in your plan. You are not expected to have a plan for the whole project but instead have a clear understanding of what you'll try to achieve in the next step and build the plan one step at a time.
  • Perform exploratory data analysis. This should include creating statistical summaries and charts, testing for anomalies, checking for correlations and other relations between variables, and other EDA elements.
  • Perform statistical inference. This should include defining the target population, forming multiple statistical hypotheses and constructing confidence intervals, setting the significance levels, conducting z or t-tests for these hypotheses.
  • Use machine learning models to predict the target variables based on your proposed plan. You should use hyperparameter tuning, model ensembling, the analysis of model selection, and other methods. The decision of where to use and not to use these techniques is up to you; however, they should be aligned with your team's objectives.
  • Deploy these machine learning models to Google Cloud Platform. You are free to choose any deployment option you wish as long as it can be called an HTTP request [sic]. (I understand they mean "as long as it can be called THROUGH an HTTP request")
  • Provide clear explanations in your notebook. Your explanations should inform the reader what you are trying to achieve, what results you got, and what these results mean.
  • Provide suggestions about how your analysis and models can be improved.

Imports and initial setup¶

In [1]:
import ipywidgets as widgets
from IPython.display import display, Markdown, Image, clear_output, HTML, IFrame
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
from mpl_toolkits.mplot3d import Axes3D
import itertools

import numpy as np
import pandas as pd
import polars as pl
import sqlite3
import seaborn as sns
from scipy import stats
from scipy.stats import chi2_contingency
import missingno as msno
import klib

import statsmodels.api as sm

from random import random, seed
import sqlite3 as lite
import logging
import warnings
import ydata_profiling
import iplantuml
import xml.dom.minidom

from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    recall_score,
    precision_score,
)

from autofeatselect import CorrelationCalculator, FeatureSelector, AutoFeatureSelect

from sklearn.preprocessing import MinMaxScaler, StandardScaler, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.tree import plot_tree
from sklearn.metrics import make_scorer, confusion_matrix, PrecisionRecallDisplay
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.base import clone as clone_pipeline
from sklearn.neighbors import NearestNeighbors

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import shap

from imblearn.under_sampling import RandomUnderSampler
import joblib

import os
from os import path

from utils import *
from utils import __

from autofeat import AutoFeatRegressor as AFR
from sklearn.model_selection import train_test_split as sklearn_train_test_split
from sklearn.exceptions import DataConversionWarning
import sweetviz as sv

import featuretools as ft
import woodwork as ww
import json
import subprocess

import category_encoders as ce

seed(100)
pd.options.display.max_rows = 100
pd.options.display.max_colwidth = 50
util.check("done")
loading utils modules... ✅ completed
configuring autoreload... ✅ completed
✅
In [3]:
%reload_ext credit_clean
%reload_ext credit_utils

import credit_clean
import credit_utils

Let's use black to auto-format all our cells so they adhere to PEP8

In [4]:
import lab_black

%reload_ext lab_black
util.patch_nb_black()
# fmt: off
# fmt: on
In [5]:
from sklearn import set_config

set_config(transform_output="pandas")
In [6]:
sns.set_theme(context="notebook", style="whitegrid")

moonstone = "#62b6cb"
moonstone_rgb = handy_utils.hex_to_rgb(moonstone)
moonstone_rgb_n = np.array(moonstone_rgb) / 255
In [7]:
logger = util.configure_logging(jupyterlab_level=logging.WARN, file_level=logging.DEBUG)

warnings.filterwarnings("ignore", category=FutureWarning)

# import warnings
# warnings.filterwarnings('error', category=pd.errors.DtypeWarning)
In [8]:
def ding(title="Ding!", message="Task completed"):
    """
    this method only works on linux
    """
    for i in range(2):
        !notify-send '{title}' '{message}'

Feature Toggles¶

Let's also create a simple feature toggle that we can use to skip expensive operations during notebook work (to save myself some time!)

Set it to true if you want to run absolutely everything. Set to false to skip optional steps/exploratory work.

In [9]:
def run_entire_notebook(filename: str = None):
    run_all = False
    if not run_all:
        print("skipping optional operation")
        fullpath = f"cached/printouts/{filename}.txt"
        if filename is not None and os.path.exists(fullpath):
            print("==== 🗃️ printing cached output ====")
            with open(fullpath) as f:
                print(f.read())
    return run_all

We will also create a small settings file to store all of the tweaks/configs we might need during production (TRAIN/FIT/DEPLOYED MODEL) so we can guarantee that we rely on this instead of having to remember/risking misconfiguring

In [10]:
prod_settings = ProdSettings("prod/settings.json")

Business Context¶

Our business wants to help lenders by providing risk evaluation services.

We can leverage our skills as Data Scientists to build various models and evaluating their performance.

Measuring performance¶

We know that this business domain (banking/loans/risk management) generally handles imbalanced datasets, so some of the metrics are immediately discarded (accuracy, is the first metric to go out the window!).

There is the temptation to use generic metrics that work "well enough" for imbalanced datasets, but for this project we want to take a step back and look at the bigger picture:

We want to measure our success in the same way that our clients measure it.

That means that we should be familiar with their criteria for success.

We should use a custom metric. The reasoning is that it is unlikely they measure success "with a f1_score() function".

If our client were a regular bank, they would be more likely to measure success based on overall cash balance (are we making enough money? are we being too risky with our bets and losing out? are we being too conservative and leaving money on the table? what about opportunity costs?)

But that's not our client. Our client Home Credit Group is an international non-bank financial institution founded in 1997 which operates in 9 countries and focuses on installment lending primarily to people with little or no credit history.

Kaggle explains that Home Credit:

strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience.

If we want to be good at helping Home Credit Group, we need to start thinking like Home Credit Group!

Based on the description on Kaggle, it seems that they care less about being extremely highly profitable and more about actually helping people and uplifting communities.

We should use a metric that reflects these values. Our task as Data Scientists is to encode these values and tell the predictive models to measure success like our clients measure it, so that when the optimize and tune them, the models take into account these guiding principles.

Choosing the right approach¶

Looking at the data, it's pretty obvious that this project is a great place to use machine learning to make predictions on the loans.

  • Machine Learning (too complex to code logic ourselves)
  • Supervised Learning (we have a target column we can use to train our models)
  • Classification (discrete output label)
  • Binary Classification (only 1 output with 2 possible values)
  • Offline training

Choosing the right metric¶

There is a temptation to think that we can just brute force this. For example, we could think: "If we use various scoring metrics and then train various models using each of these metrics, we can see which one gets better scores and pick that metric and model. The slides will be fantastic!"... However, this is a trap.

We will try to explain why it is, using an analogy:

Suppose we are a creating a "Personal Shopper bot" that will help clients update their wardrobe. A generic approach might be to choose clothes that are in fashion and high-quality. However, this might not align with the client's personal style or comfort. Maybe they are looking for comfy stay-at-home clothes!

It would be incorrect to create various personal shopper bots, train each of them in different criteria (pick only fashionable clothes, pick only high-quality clothes), and then compare each of the bots and see how they perform according to each of the specific functions. These scores tells us nothing about how things will work on the real world (what do actual customers need and value?). They only tell us how well the model is at fitting the data, according to that function, but it does not mean that real customers will actually be guided by any of these function.

This is what such situation could look like:

I hope this picture illustrates why Model 5 is the correct model to use: it's the one who has the highest score, using the criteria that our customer uses.

How to measure performance¶

For this reason, we will need to make a judgement call right now, as part of this project's storytelling: We need to find out (= "decide") what our company values.

We want to make as many correct guesses as possible, but we also want to recognize that some errors are worse than others...

That is:

predicted 0 - "no struggle" predicted 1 - "struggle"
actual 0 - "no struggle" ➕ ✅ GOOD ➖ ❌ NOT GREAT
actual 1 - "struggle" ➖ ❌ TERRIBLE ➕ ✅ GOOD

False negatives (predict "no struggle" but the customer did struggle to repay) are worse than the opposite (False Positives).

We are using this terminology to match the criteria used in the TARGET column:

  • 0 = no difficulties
  • 1 = had difficulties

While we want to use different weights for the errors, we want to make sure that we don't accidentally make FN infinitely worse than FP, as this would result in a model that leans too hard in that direction.

We want to use the right weights as a tool to help us improve our model so it scores best (as many correct guesses as possible).

So:

  • the actual measurement = how many are correctly guessed.
  • the tool we can use to tweak our model = the weights for correct/incorrect guesses

What is our threshold to "go live"?¶

Because this is a serious enterprise, we don't want to deploy a model that performs poorly. But we also want to recognize that this is iteration 1.0 for this new finance division, so we will set a goal that is ambitious but also realistic.

  • We want to make correct predictions most of the time
    • Accuracy >= 80%
  • We want to correctly identify >= 60% of the people who will struggle to pay so we can build better suited plans that accommodate their needs.
    • TPR - Recall/Sensitivity >= 80%
  • We want our model to not be too skewed towards predicting "struggle" so we don't spend too much money on additional secondary checks (which our operations team would have to do manually)
    • TNR >= 50%

PS: Spoiler - We don't know how this project will go, but since it's a learning project and it requires a deployment to Google Cloud, we will ultimately deploy whatever best models we can make, even if they do not reach our target. But in a real business environment, we would be doing this at this stage: engaging the business decision-makers and deciding the go/no-go cut off before we even begin with the technical work.

Now that we have defined our criteria for success, we can finally get started!

Data Acquisition¶

Downloading the dataset¶

💡 If you don't have credentials for kaggle, you can just download the dataset from the links above, in the Context-Requirements section

In [11]:
competition_name = "home-credit-default-risk"
db_filename = "HomeCredit_columns_description.csv"

auto_kaggle.download_dataset(
    competition_name, db_filename, timeout_seconds=120, is_competition=True
)
__
Kaggle API 1.5.13 - login as 'edualmas'
File [dataset/HomeCredit_columns_description.csv] already exists locally!
No need to re-download data for competition [home-credit-default-risk]

Understanding the dataset¶

We want to get a general understanding of the data that we have been provided:

  1. [x] Get an understanding of the data we have

    1. [x] How is the train/test data split?
    2. [x] How are the related tables linked to application_* table?
    3. [x] Ensure there is no confidential/sensitive data in this dataset (done! this dataset was pretty clean and massaged)
  2. [x] Format data so we can easily retrieve it later

    1. [x] Drop clearly unneded data
    2. [x] Find a suitable file format
    3. [x] Separate test data from train data to ensure we don't accidentally use/see it.

Let's rename and standardize the names so they are all using lowercase camel_case for all datasets and fields

In [12]:
if run_entire_notebook():
    !mv dataset/POS_CASH_balance.csv dataset/pos_cash_balance.csv
skipping optional operation
In [13]:
all_csvs = {
    "bureau_balance.csv": "ascii",
    "credit_card_balance.csv": "ascii",
    "installments_payments.csv": "ascii",
    "previous_application.csv": "ascii",
    "application_train.csv": "ascii",
    "application_test.csv": "ascii",
    "bureau.csv": "ascii",
    "pos_cash_balance.csv": "ascii",
}
In [14]:
if run_entire_notebook("lower_case_field_names"):
    with fs_utils.in_subdir("dataset"):
        for csvfile in all_csvs:
            print(f"converting headers to lower case:", csvfile)
            output_filename = "lower_" + csvfile
            command = f"awk 'NR==1 {{print tolower($0)}} NR!=1 {{print $0}}' {csvfile} > {output_filename}"
            !mv $output_filename $csvfile
            subprocess.run(command, shell=True, check=True)
skipping optional operation
==== 🗃️ printing cached output ====
converting headers to lower case: bureau_balance.csv
converting headers to lower case: credit_card_balance.csv
converting headers to lower case: installments_payments.csv
converting headers to lower case: previous_application.csv
converting headers to lower case: application_train.csv
converting headers to lower case: application_test.csv
converting headers to lower case: bureau.csv
converting headers to lower case: pos_cash_balance.csv

Let's import all of the data to a sqlite database, so we can query it easily

In [15]:
def import_all_csvs_into_sqlite():
    if os.path.exists("dataset/db.sqlite"):
        print("deleting existing file: dataset/db.sqlite")
        !rm dataset/db.sqlite
    for csv in map(lambda x: f"dataset/{x}", all_csvs.keys()):
        print(f"importing {csv}")
        !csvs-to-sqlite $csv dataset/db.sqlite


if run_entire_notebook("merge_csvs"):
    import_all_csvs_into_sqlite()
skipping optional operation
==== 🗃️ printing cached output ====
importing dataset/application_test.csv
Loaded 1 dataframes
Created dataset/db.sqlite from 1 CSV file
importing dataset/bureau_balance.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/credit_card_balance.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/installments_payments.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/previous_application.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/application_train.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/bureau.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite
importing dataset/POS_CASH_balance.csv
Loaded 1 dataframes
Added 1 CSV file to dataset/db.sqlite

ERD diagram¶

This is the ERD diagram that we can see in the official documentation.

It seems that the 2 main tables are application_train and application_test, and that these two act as the central table in a pseudo "star schema"/snowflake (depending on how picky we want to get with the definitions).

A few of the key assumptions/understandings we have right now is:

  1. Each of the 2 application_* is uniquely bound to the other tables, with no overlapping data/attributes. That is: a specific application (and by extension the rows in other tables that reference it), either belong to train or to test, never to both.
  2. Similarly, if there's any data that cannot be linked to a specific train/test application, it can be dropped (should we find any!)
  3. The last important bit is not an assumption, but a fact straight from the dataset's documentation:
    • the application_test table does not have a TARGET column (In kaggle competitions, the actual test split is validated by the kaggle server)

From the docs:

  • application_{train|test}.csv
    • This is the main table, broken into two files for Train (with TARGET) and Test (without TARGET).
    • Static data for all applications. One row represents one loan in our data sample.

We will validate our assumptions soon

Making sure we imported all the data¶

In [16]:
@cached_dataframe()
def eda_check_all_tables_have_data():
    all_tables = [
        "application_train",
        "application_test",
        "bureau",
        "bureau_balance",
        "credit_card_balance",
        "installments_payments",
        "previous_application",
        "pos_cash_balance",
    ]
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
        summary = {}
        for table in all_tables:
            c = original_dataset_conn.cursor()
            q = f"SELECT count(*) FROM {table}"
            c.execute(q)
            rowcount = c.fetchone()[0]
            summary[table] = rowcount
        return pd.DataFrame(summary.values(), index=summary.keys(), columns=["row_count"])


rowcounts = eda_check_all_tables_have_data()
number_of_tables_with_0_rows = rowcounts[rowcounts == 0].count()
assert number_of_tables_with_0_rows.sum() == 0
display(rowcounts)
Loading from cache [./cached/df/eda_check_all_tables_have_data.parquet]
row_count
application_train 307511
application_test 48744
bureau 1716428
bureau_balance 27299925
credit_card_balance 3840312
installments_payments 13605401
previous_application 1670214
POS_CASH_balance 10001358

Checking the dataset (im)balance¶

Considering this a capstone project, we already know that this will be imbalanced.

The question is "how terrible" will the imbalance be?

In [17]:
@run
@cached_with_pickle()
def eda_query_target_balance():
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
        original_dataset_conn = sql_utils.Sql(original_dataset_conn)
        q = "SELECT target, count(*) FROM application_train GROUP BY target"
        return original_dataset_conn.peek("", q, limit=15)
Loading from cache [./cached/pickle/eda_query_target_balance.pickle]
TARGET count(*)
0 0 282686
1 1 24825

⚠️ This dataset is quite imbalanced!

We must keep this imbalance in mind (~ 12-to-1) during the rest of the steps.

2-to-1 would be noteworthy, >10-to-1 is going to be a challenge to handle delicately.

Checking the kaggle train/test split¶

The last thing we wanted to check was the first assumption we made "there is no overlap between train/test IDs".

Since these IDs are used to relate info across various tables, we expect 0 rows to show up below. If anything shows up, we have a problem with a potentially corrupted dataset or a malformed train/test split:

In [18]:
@cached_with_pickle()
def eda_query_train_test_overlap():
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
        original_dataset_conn = sql_utils.Sql(original_dataset_conn)
        q = (
            "SELECT SK_ID_CURR FROM application_train"
            " INTERSECT "
            "SELECT SK_ID_CURR FROM application_test"
        )
        return original_dataset_conn.peek("", q, limit=15)


data_leakage_rowcount = eda_query_train_test_overlap()
display(data_leakage_rowcount)
print("We expect this table to be empty (0 rows)", end=": ")
util.check(data_leakage_rowcount.shape[0] == 0)
Loading from cache [./cached/pickle/eda_query_train_test_overlap.pickle]
SK_ID_CURR
We expect this table to be empty (0 rows): ✅
In [19]:
def check_no_data_overlaps(conn, tables=None):
    if not tables:
        tables = ["training", "val", "test", "hyper"]
    dfcols = {}
    for t1 in tables:
        results = []
        for t2 in tables:
            sql = f"""
            SELECT count()
            FROM (
                SELECT SK_ID_CURR
                FROM APPLICATION_{t1}
            INTERSECT
                SELECT SK_ID_CURR
                FROM APPLICATION_{t2}
            );
            """
            cursor = conn.cursor()
            cursor.execute(sql)
            result = cursor.fetchone()[0]
            results.append(result)
        dfcols[t1] = results

    return pd.DataFrame(dfcols, index=tables)
In [20]:
@cached_with_pickle()
def eda_data_overlap_original_dataset():
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_dataset_conn:
        return check_no_data_overlaps(original_dataset_conn, ["TRAIN", "TEST"])


display(eda_data_overlap_original_dataset())
Loading from cache [./cached/pickle/eda_data_overlap_original_dataset.pickle]
TRAIN TEST
TRAIN 307511 0
TEST 0 48744

Let's remember this 307511 because we will need it later to make sure no data got lost!

It seems that there is no overlap between the TRAIN and the TEST datasets.

This means that if we split the data using SK_ID_CURR from the right table, we are guaranteed to not have any piece of data out of place.

Dropping unnecessary data¶

This dataset contains 2 distinct Application tables: application_train and application_test.

Since this data comes from a Kaggle Competition, the test split does not contain any TARGET column, which means we cannot actually use it for testing how our model performs on unseen data.

Normally, it'd be fun to try to submit our predictions to kaggle to see how we did. However, even this is not available to us. Since the competition is from a few years ago, it's already closed and it does not allow new predictions to be submitted/scored.

This means that instead of using the kaggle-provided split for our train/test, we will likely have to drop the test partition, and create sub-datasets from the train ourselves.

Deleting kaggle-test data¶

In [21]:
def generate_delete_statement(table_name, valid_tablename):
    if table_name == "bureau_balance":
        return """DELETE FROM bureau_balance
                  WHERE bureau_balance.SK_ID_BUREAU NOT IN (
                    SELECT bureau.SK_ID_BUREAU FROM bureau
                  );"""
    else:
        return f"""DELETE FROM {table_name}
                   WHERE SK_ID_CURR NOT IN (SELECT SK_ID_CURR FROM {valid_tablename});"""
In [22]:
def drop_data(conn: sqlite3.Connection, split_name: str) -> None:
    valid_tablename = f"application_{split_name}"
    secondary_tables = [
        "bureau",
        "bureau_balance",
        "credit_card_balance",
        "installments_payments",
        "previous_application",
        "pos_cash_balance",
    ]
    sql_statements = {
        table: generate_delete_statement(table, valid_tablename)
        for table in secondary_tables
    }

    for table, sql in sql_statements.items():
        print(f"Deleting unnecessary data from... {table}: ", end="")
        cursor = conn.cursor()
        cursor.execute(sql)
        rows = cursor.rowcount
        print(f"{rows:,} rows deleted")

    all_partitions = [
        "application_hyper",
        "application_val",
        "application_test",
        "application_training",
    ]
    for tb in set(all_partitions) - {valid_tablename}:
        q = f"DROP TABLE IF EXISTS {tb};"
        print(q)
        cursor.execute(q)

    conn.commit()
    print("vacuuming db")
    conn.execute("VACUUM")
    print("vacuum completed")
In [23]:
def create_db(split_name: str) -> None:
    print(f"Moving all data for {split_name} into its own table")
    filename = f"dataset/split_{split_name}.sqlite"
    !cp dataset/db.sqlite $filename
    with sql_utils.AutoClosing(filename) as conn:
        drop_data(conn, split_name)
        print(f"Completed. The database [{filename}] only contains {split_name} data")
In [24]:
if run_entire_notebook("delete_test_split"):
    create_db("train")
    !rm "dataset/db.sqlite"
    !mv "dataset/split_train.sqlite" "dataset/db.sqlite"
skipping optional operation
==== 🗃️ printing cached output ====
Moving all data for train into its own table
Deleting unnecessary data from... bureau: 251,103 rows deleted
Deleting unnecessary data from... bureau_balance: 12,598,313 rows deleted
Deleting unnecessary data from... credit_card_balance: 612,347 rows deleted
Deleting unnecessary data from... installments_payments: 0 rows deleted
Deleting unnecessary data from... previous_application: 256,513 rows deleted
Deleting unnecessary data from... POS_CASH_balance: 1,457,983 rows deleted
DROP TABLE IF EXISTS application_hyper;
DROP TABLE IF EXISTS application_val;
DROP TABLE IF EXISTS application_training;
DROP TABLE IF EXISTS application_test;
vacuuming db
vacuum completed
Completed. The database [dataset/split_train.sqlite] only contains train data

Creating new data splits¶

Now that we have removed all the data that we were not going to be able to use, we need to create new data splits.

Not only we need new train and test splits, we will also need a couple more independent datasets for this project:

  • a train dataset: for most of the model training
  • a hyperparam dataset: for hyperparameter tuning
  • a val dataset: a validation set to perform validations on unseen data
  • a test dataset: a final test split to make the final deploy/kill decision for the selected model.
    • Psst.! here's a secret: since this is a learning project, and we HAVE TO deploy something to Google Cloud, we will deploy whatever model(s) we end up with but in a real world scenario, this would be a 👍/👎 roman-emperor decision about the go-live of the model.

We will use a custom function to create these 4 splits from the data that was originally marked as training.

Let's create some utility functions that we will need

In [25]:
def check_no_data_got_lost(conn):
    cte = """
    WITH new_tables AS ( SELECT * FROM application_training
                         UNION ALL
                         SELECT * FROM application_hyper
                         UNION ALL
                         SELECT * FROM application_val
                         UNION ALL
                         SELECT * FROM application_test ),
     original_table AS ( SELECT * FROM application_original )
    """

    check_1 = f"""{cte} SELECT * FROM new_tables EXCEPT SELECT * FROM original_table;"""
    check_2 = f"""{cte} SELECT * FROM original_table EXCEPT SELECT * FROM new_tables;"""

    cursor = conn.cursor()
    cursor.execute(check_1)
    result1 = cursor.fetchone()
    cursor.execute(check_2)
    result2 = cursor.fetchone()
    if result1:
        print(
            "the new tables have more data than the original table (check for possible data corruption)"
        )

    if result2:
        print(
            "the new tables have less data than the original table (some data has been lost)"
        )

    return not (result1 or result2)
In [26]:
def prepare_db_for_split(cursor):
    cursor.execute("DROP TABLE IF EXISTS application_training;")
    cursor.execute("DROP TABLE IF EXISTS application_hyper;")
    cursor.execute("DROP TABLE IF EXISTS application_val;")
    cursor.execute("DROP TABLE IF EXISTS application_test;")

    q = "SELECT name FROM sqlite_master WHERE type='table' AND name='application_train';"
    cursor.execute(q)
    print(q)
    original_exists = cursor.fetchone()
    print("original_exists:", original_exists)
    if original_exists:
        q = "ALTER TABLE application_train RENAME TO application_original;"
        print(q)
        cursor.execute(q)

Then we can split our data into 4 parts, ensuring that the data is randomly shuffled, and that no data is accidentally reused twice, or accidentally avoided. All rows will end up in 1 split, and 1 split only.

In [27]:
if run_entire_notebook("split_into_4_datasets"):
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_db_conn:
        cursor = original_db_conn.cursor()
        prepare_db_for_split(cursor)

        tr = pd.read_sql("SELECT * FROM application_original;", original_db_conn)

        splits = split_utils.split_dataset(
            tr, "TARGET", {"training": 0.7, "hyper": 0.1, "val": 0.1, "test": 0.1}
        )

        for name, (_x, _y) in splits.items():
            print(f"storing {name} split. {_x.shape[0]} rows")
            _df = _x.copy()
            _df["TARGET"] = _y

            cols = list(_df.columns)
            cols = cols[:-1]
            cols.insert(1, _df.columns[-1])
            _df = _df[cols]

            _df.to_sql(f"application_{name}", original_db_conn, index=False)

        if check_no_data_got_lost(original_db_conn):
            cursor.execute("DROP TABLE IF EXISTS application_original;")
            print("all data has been accounted for ✅")
        else:
            print(
                "something failed. skipping deleting original data so it can be used for inspection"
            )
skipping optional operation
==== 🗃️ printing cached output ====
SELECT name FROM sqlite_master WHERE type='table' AND name='application_train';
original_exists: ('application_train',)
ALTER TABLE application_train RENAME TO application_original;
storing training split. 215258 rows
storing hyper split. 30751 rows
storing val split. 30751 rows
storing test split. 30751 rows
all data has been accounted for ✅

Alright! things look good.

Let's compare the data with the original dataset, to make sure that we didn't lose anything. Recall that this is what the data looked like:

Let's compare it to the 4-way split we just created from the TRAIN data (we dropped TEST data because it was unusable to us)

In [28]:
@cached_dataframe()
def eda_check_no_overlap():
    with sql_utils.AutoClosing("dataset/db.sqlite") as original_db_conn:
        return check_no_data_overlaps(original_db_conn)


display(eda_check_no_overlap())
Loading from cache [./cached/df/eda_check_no_overlap.parquet]
training val test hyper
training 215258 0 0 0
val 0 30751 0 0
test 0 0 30751 0
hyper 0 0 0 30751
In [29]:
train_data_from_original_dataset = eda_data_overlap_original_dataset()["TRAIN"].sum()
all_data_after_split = eda_check_no_overlap().sum().sum()

util.check(train_data_from_original_dataset == all_data_after_split)
Loading from cache [./cached/pickle/eda_data_overlap_original_dataset.pickle]
Loading from cache [./cached/df/eda_check_no_overlap.parquet]
✅

Creating 4 separate DBs¶

Now that we have all the data split into 4 tables, we can create separate files so we can just load the data that we need (and avoid peeking in the wrong place)

In [30]:
if run_entire_notebook("split_each_dataset_to_its_own_db"):
    create_db("hyper")
    create_db("val")
    create_db("test")
    create_db("training")
skipping optional operation
==== 🗃️ printing cached output ====
Moving all data for hyper into its own table
Deleting unnecessary data from... bureau: 1,319,062 rows deleted
Deleting unnecessary data from... bureau_balance: 13,226,798 rows deleted
Deleting unnecessary data from... credit_card_balance: 2,905,033 rows deleted
Deleting unnecessary data from... installments_payments: 0 rows deleted
Deleting unnecessary data from... previous_application: 1,272,436 rows deleted
Deleting unnecessary data from... POS_CASH_balance: 7,690,785 rows deleted
DROP TABLE IF EXISTS application_val;
DROP TABLE IF EXISTS application_training;
DROP TABLE IF EXISTS application_test;
vacuuming db
vacuum completed
Completed. The database [dataset/split_hyper.sqlite] only contains hyper data
Moving all data for val into its own table
Deleting unnecessary data from... bureau: 1,317,865 rows deleted
Deleting unnecessary data from... bureau_balance: 13,201,811 rows deleted
Deleting unnecessary data from... credit_card_balance: 2,907,762 rows deleted
Deleting unnecessary data from... installments_payments: 0 rows deleted
Deleting unnecessary data from... previous_application: 1,271,947 rows deleted
Deleting unnecessary data from... POS_CASH_balance: 7,682,249 rows deleted
DROP TABLE IF EXISTS application_hyper;
DROP TABLE IF EXISTS application_training;
DROP TABLE IF EXISTS application_test;
vacuuming db
vacuum completed
Completed. The database [dataset/split_val.sqlite] only contains val data
Moving all data for test into its own table
Deleting unnecessary data from... bureau: 1,318,854 rows deleted
Deleting unnecessary data from... bureau_balance: 13,230,887 rows deleted
Deleting unnecessary data from... credit_card_balance: 2,911,086 rows deleted
Deleting unnecessary data from... installments_payments: 0 rows deleted
Deleting unnecessary data from... previous_application: 1,274,191 rows deleted
Deleting unnecessary data from... POS_CASH_balance: 7,698,680 rows deleted
DROP TABLE IF EXISTS application_hyper;
DROP TABLE IF EXISTS application_val;
DROP TABLE IF EXISTS application_training;
vacuuming db
vacuum completed
Completed. The database [dataset/split_test.sqlite] only contains test data
Moving all data for training into its own table
Deleting unnecessary data from... bureau: 440,194 rows deleted
Deleting unnecessary data from... bureau_balance: 4,445,340 rows deleted
Deleting unnecessary data from... credit_card_balance: 960,014 rows deleted
Deleting unnecessary data from... installments_payments: 0 rows deleted
Deleting unnecessary data from... previous_application: 422,529 rows deleted
Deleting unnecessary data from... POS_CASH_balance: 2,558,411 rows deleted
DROP TABLE IF EXISTS application_hyper;
DROP TABLE IF EXISTS application_val;
DROP TABLE IF EXISTS application_test;
vacuuming db
vacuum completed
Completed. The database [dataset/split_training.sqlite] only contains training data

Now we can delete the intermediate tables

Last thing we want to do is add some indexes to speed up our queries and joins:

In [31]:
if run_entire_notebook("create_indexes"):
    for split in ["training", "hyper", "val", "test"]:
        with sql_utils.AutoClosing(f"dataset/split_{split}.sqlite") as split_conn:
            c = sql_utils.Sql(split_conn)
            print("-------")
            c.create_index("main", f"application_{split}", "SK_ID_CURR")
            c.create_index("main", "bureau", "SK_ID_CURR")
            c.create_index("main", "bureau", "SK_ID_BUREAU")
            c.create_index("main", "bureau_balance", "SK_ID_BUREAU")
            c.create_index("main", "credit_card_balance", "SK_ID_CURR")
            c.create_index("main", "credit_card_balance", "SK_ID_PREV")
            c.create_index("main", "installments_payments", "SK_ID_CURR")
            c.create_index("main", "installments_payments", "SK_ID_PREV")
            c.create_index("main", "installments_payments", "num_instalment_number")
            c.create_index("main", "installments_payments", "num_instalment_version")

            c.create_index("main", "pos_cash_balance", "SK_ID_CURR")
            c.create_index("main", "pos_cash_balance", "SK_ID_PREV")
            c.create_index("main", "previous_application", "SK_ID_CURR")
            c.create_index("main", "previous_application", "SK_ID_PREV")
skipping optional operation
==== 🗃️ printing cached output ====
-------
creating index on main.application_training ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_BUREAU']... done ✅
creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅
creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅
creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['SK_ID_CURR']... done ✅
creating index on main.installments_payments ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['num_instalment_number']... done ✅
creating index on main.installments_payments ['num_instalment_version']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅
creating index on main.previous_application ['SK_ID_CURR']... done ✅
creating index on main.previous_application ['SK_ID_PREV']... done ✅
-------
creating index on main.application_hyper ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_BUREAU']... done ✅
creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅
creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅
creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['SK_ID_CURR']... done ✅
creating index on main.installments_payments ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['num_instalment_number']... done ✅
creating index on main.installments_payments ['num_instalment_version']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅
creating index on main.previous_application ['SK_ID_CURR']... done ✅
creating index on main.previous_application ['SK_ID_PREV']... done ✅
-------
creating index on main.application_val ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_BUREAU']... done ✅
creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅
creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅
creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['SK_ID_CURR']... done ✅
creating index on main.installments_payments ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['num_instalment_number']... done ✅
creating index on main.installments_payments ['num_instalment_version']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅
creating index on main.previous_application ['SK_ID_CURR']... done ✅
creating index on main.previous_application ['SK_ID_PREV']... done ✅
-------
creating index on main.application_test ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_CURR']... done ✅
creating index on main.bureau ['SK_ID_BUREAU']... done ✅
creating index on main.bureau_balance ['SK_ID_BUREAU']... done ✅
creating index on main.credit_card_balance ['SK_ID_CURR']... done ✅
creating index on main.credit_card_balance ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['SK_ID_CURR']... done ✅
creating index on main.installments_payments ['SK_ID_PREV']... done ✅
creating index on main.installments_payments ['num_instalment_number']... done ✅
creating index on main.installments_payments ['num_instalment_version']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_CURR']... done ✅
creating index on main.POS_CASH_balance ['SK_ID_PREV']... done ✅
creating index on main.previous_application ['SK_ID_CURR']... done ✅
creating index on main.previous_application ['SK_ID_PREV']... done ✅

One final check, just to make sure that all our datasets are representative and similar:

In [32]:
if run_entire_notebook("split_check_imbalance_on_all_splits"):
    for split in ["training", "hyper", "val", "test"]:
        with sql_utils.AutoClosing(f"dataset/split_{split}.sqlite") as split_conn:
            c = sql_utils.Sql(split_conn)
            print(split)
            print(
                c.peek(
                    "",
                    f"""SELECT TARGET,
                       COUNT(*) AS count,
                       COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS percentage
                    FROM application_{split}
                    GROUP BY TARGET""",
                )
            )
skipping optional operation
==== 🗃️ printing cached output ====
training
   TARGET   count  percentage
0       0  197880   91.926897
1       1   17378    8.073103

hyper
   TARGET  count  percentage
0       0  28269   91.928718
1       1   2482    8.071282

val
   TARGET  count  percentage
0       0  28268   91.925466
1       1   2483    8.074534

test
   TARGET  count  percentage
0       0  28269   91.928718
1       1   2482    8.071282

Each of the 4 parts show similar %. We are confident that the split was properly randomized and stratified.

Exploratory Data Analysis¶

Let's take a look at the data. We're particularly interested in:

  • finding interesting insights in the patterns that we can use to generate features
  • finding out which fields might give us trouble (NAs, too many categories to 1-hot-encode, ...)
  • finding out how the data is distributed in each of the "other" tables
  • finding how the values change depending on our target variable

How are we going to analyze our data?

  • Some of these analysis will require a raw subsample of our data (randomly shuffled, and sampled).
  • Some others analysis (the ones that try to understand the differences between the TARGET class) will require an artificially balanced sample. This is to make sure that we can see the forest for the trees, and that we do not get overwhelmly skewed data due to the 12-to-1 imbalance.

Which dataset are we going to use?

  • We will use a small-ish dataset. We are only interesed in finding insights. Considering what we have seen (good shuffle, good stratification), we expect all 4 datasets to be similar and representative of each other. So we will pick one of the smaller ones, at our discretion, just to make the queries/analysis a bit faster/lighter.
  • We are not interested in seeing "totals" just comparisons and trends (is piece of data X contributing in helping us predict? is the ratio similar between X and Y? etc...) so we will not be using any global counts/totals.
  • For this analysis, we will use the hyperparameter data split (hyper), which is 10% of the overall data and $1\over7$ the size of the training dataset.

Data Dictionary¶

From the data dictionary, we can extract:

Target variable

  • 1 = client with payment "difficulties" (data dict contains more specifics)
  • 0 = all other cases

In summary, 0 is the ideal result, and 1 indicates troubles.

Exploring our db tables¶

This table is the primary table in our database. It contains the core features and target variable.

All other tables have FK referencing this table's PK.

In [33]:
eda_conn = sqlite3.connect("dataset/split_hyper.sqlite")
app_sample = pd.read_sql_query("SELECT * FROM application_hyper", eda_conn)
In [34]:
def balance(df: pd.DataFrame, target: str) -> pd.DataFrame:
    rus = RandomUnderSampler(random_state=42)
    X, y = rus.fit_resample(df.drop(columns=target), df[target])
    return pd.concat([X, y], axis=1)
In [35]:
app_sample_balanced = balance(app_sample, "TARGET")
In [36]:
app_sample.head()
Out[36]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
0 182534 0 Cash loans F N Y 0 67500.0 332842.5 12676.5 ... 0 0 0 0 0.0 0.0 1.0 0.0 0.0 1.0
1 226983 0 Cash loans M N N 1 135000.0 354276.0 25803.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0
2 176676 0 Cash loans F Y Y 0 265500.0 1762110.0 46480.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 1.0 2.0
3 266436 0 Cash loans M Y Y 0 90000.0 247500.0 16978.5 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0
4 205215 1 Cash loans M N Y 0 90000.0 417024.0 28341.0 ... 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 122 columns

In [37]:
app_sample["TARGET"].value_counts()
Out[37]:
0    28269
1     2482
Name: TARGET, dtype: int64
In [38]:
app_sample_balanced["TARGET"].value_counts()
Out[38]:
0    2482
1    2482
Name: TARGET, dtype: int64

Visualizing data¶

We want to take a quick look at our fields to understand which ones might be useful in helping predict the target variable.

We will be looking at the artificially balanced dataset (obviously). The 50% line is the baseline, anything closer to 0 indicates "better" (less difficulties) for our loan customers.

In [39]:
if run_entire_notebook("sweetviz_report"):
    sv_report = sv.analyze(
        app_sample_balanced, target_feat="TARGET", pairwise_analysis="off"
    )
    sv_report.show_html("cached/sweetviz_report.html")

display(IFrame("cached/sweetviz_report.html", width=1600, height=600))
skipping optional operation
==== 🗃️ printing cached output ====
Report cached/sweetviz_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

A few things to highlight from this report:

Features that seem to noticeably influence TARGET:

  • EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3 - very strongly! (the higher, the better, for all 3 scores)
  • AMT_REQ_CREDIT_BUREAU_YEAR
  • The age of the person DAYS_BIRTH
  • Urban Areas seem to perform better than rural REGION_POPULATION_RELATIVE
  • The documents (FLAG_DOCUMENT_xx). When documents are provided, the risk decreases noticeably.
  • DAYS_LAST_PHONE_CHANGE - How long ago they changed their phone (it's unclear if they mean "changed the terminal" or "changed the phone number").
  • YEARS_BUILD_MEDI/MODE/AVG (And other fields about the building they live in)
  • At what time they came into the branch to start the process HOUR_APPR_PROCESS_START
    • I did not imagine this would be relevant, but it might be strongly linked to employment type (can only come in the evening, etc...)
  • AMT_ANNUITY
    • If they can afford large annuities, it's likely that they are able to have no issues (maybe: negotiate lower payments if they go through difficult periods, etc..)

This is a quick visual analysis. A bit further down, we will use statistical tools to try to validate these hints.

Understanding missing data to inform our feature selection¶

Probably the simplest way to understand our features is to use a Filter-Based Feature Selection Technique called Missing Value Ratio. This calculation let's us see which features are more likely to be useful/useless for our model.

We will use some libraries as well as some custom-built methods to understand how our data is spread.

Further down, we will see and use more advanced methodologies to help us further with our feature selection.

In [40]:
app_sample_balanced
Out[40]:
SK_ID_CURR NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE ... FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR TARGET
0 189708 Cash loans M N Y 0 135000.0 592560.0 31153.5 450000.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 3.0 0
1 290477 Cash loans M N N 0 225000.0 315000.0 33075.0 315000.0 ... 0 0 0 0.0 1.0 0.0 1.0 0.0 3.0 0
2 333630 Cash loans F Y Y 0 148500.0 518562.0 36220.5 463500.0 ... 0 0 0 0.0 0.0 0.0 0.0 1.0 0.0 0
3 244267 Cash loans F N Y 1 157500.0 781920.0 34573.5 675000.0 ... 0 0 0 0.0 0.0 0.0 1.0 0.0 4.0 0
4 326845 Cash loans F N Y 0 76500.0 663669.0 22063.5 504000.0 ... 0 0 0 0.0 0.0 0.0 1.0 0.0 3.0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4959 313757 Cash loans M N Y 0 157500.0 125640.0 8626.5 90000.0 ... 0 0 0 NaN NaN NaN NaN NaN NaN 1
4960 281030 Cash loans F Y N 0 144000.0 1272888.0 37345.5 1111500.0 ... 0 0 0 NaN NaN NaN NaN NaN NaN 1
4961 340401 Cash loans M Y Y 0 90000.0 1390500.0 39042.0 1390500.0 ... 0 0 0 NaN NaN NaN NaN NaN NaN 1
4962 198003 Cash loans F N N 0 144000.0 1325475.0 56160.0 1125000.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1
4963 254465 Cash loans F N Y 0 135000.0 942300.0 27679.5 675000.0 ... 0 0 0 0.0 0.0 0.0 0.0 0.0 2.0 1

4964 rows × 122 columns

In [41]:
def plot_missingnos_comparison(
    df: pd.DataFrame, target_col: str, color=None, sort_first=True
):
    if color is None or len(color) != 3:
        color = (0.25, 0.25, 0.25)

    f, ax = plt.subplots(1, 1, figsize=(40, 6))
    target_zero = df[df[target_col] == 0]
    target_one = df[df[target_col] == 1]

    target_zero_nas = target_zero.isna().sum() > 0
    target_one_nas = target_one.isna().sum() > 0

    target_zero_nas = target_zero_nas[target_zero_nas > 0]
    target_one_nas = target_one_nas[target_one_nas > 0]

    cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
    target_zero = target_zero[cols_to_display]
    target_one = target_one[cols_to_display]

    target_zero.columns = target_zero.columns.map(lambda n: f"{n}_target_0")
    target_one.columns = target_one.columns.map(lambda n: f"{n}_target_1")

    if sort_first:
        zero_sorted = msno.nullity_sort(target_zero, sort="descending").reset_index(
            drop=True
        )
        one_sorted = msno.nullity_sort(target_one, sort="descending").reset_index(
            drop=True
        )
    else:

        zero_sorted = msno.nullity_sort(target_zero, sort=None).reset_index(drop=True)
        one_sorted = msno.nullity_sort(target_one, sort=None).reset_index(drop=True)

    merged = zero_sorted.join(one_sorted)
    sorted_cols = sorted(merged.columns)

    msno.matrix(merged[sorted_cols], labels=True, ax=ax, sparkline=False, color=color)

    chart_utils.rotate_x_labels(ax)

    return f.gca()
In [42]:
@run
@cached_chart()
def eda_missing_no_matrix_comparison_target():
    return plot_missingnos_comparison(
        app_sample_balanced, "TARGET", color=moonstone_rgb_n
    )
Loading from cache [./cached/charts/eda_missing_no_matrix_comparison_target.png]

Correlations across columns when they are missing data¶

The missingno library allows us to see that there are strong patterns across columns, where data is missing.

In [43]:
@run
@cached_chart()
def eda_missing_no_matrix_heatmap():
    f, ax = plt.subplots(1, 2, figsize=(60, 20))

    target_zero = app_sample_balanced[app_sample_balanced["TARGET"] == 0]
    target_one = app_sample_balanced[app_sample_balanced["TARGET"] == 1]

    target_zero_nas = target_zero.isna().sum() > 0
    target_one_nas = target_one.isna().sum() > 0

    target_zero_nas = target_zero_nas[target_zero_nas > 0]
    target_one_nas = target_one_nas[target_one_nas > 0]

    cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
    target_zero = target_zero[cols_to_display]
    target_one = target_one[cols_to_display]

    msno.heatmap(target_zero, labels=True, ax=ax[0])
    msno.heatmap(target_one, labels=True, ax=ax[1])
    ax[0].set_title("TARGET = 0 (no difficulties)")
    ax[1].set_title("TARGET = 1 (difficulties)")
    plt.tight_layout()
    return f
Loading from cache [./cached/charts/eda_missing_no_matrix_heatmap.png]
In [44]:
@run
@cached_chart()
def eda_missing_no_matrix_dendrogram():
    f, ax = plt.subplots(1, 2, figsize=(30, 15))

    target_zero = app_sample_balanced[app_sample_balanced["TARGET"] == 0]
    target_one = app_sample_balanced[app_sample_balanced["TARGET"] == 1]

    target_zero_nas = target_zero.isna().sum() > 0
    target_one_nas = target_one.isna().sum() > 0

    target_zero_nas = target_zero_nas[target_zero_nas > 0]
    target_one_nas = target_one_nas[target_one_nas > 0]

    cols_to_display = set(target_zero_nas.index).union(set(target_one_nas.index))
    target_zero = target_zero[cols_to_display]
    target_one = target_one[cols_to_display]

    msno.dendrogram(target_zero, ax=ax[0])
    msno.dendrogram(target_one, ax=ax[1])
    ax[0].set_title("TARGET = 0 (no difficulties)")
    ax[1].set_title("TARGET = 1 (difficulties)")
    plt.tight_layout()
    return f
Loading from cache [./cached/charts/eda_missing_no_matrix_dendrogram.png]

To keep the charts simpler, we are only including columns that are missing some data.

It seems that there is some pattern to the missing fields. The good news is that there seem to be very strong correlations in the missing values (the heatmap leans strongly towards +1), indicating that there are clear clusters of columns that behave similarly (missing data in one column can be used to confidently predict missing data in other columns).

The matrix and the dendrogram visualizations show the same patterns. I was looking for a clear ways to distinguish and find differences between the two datasets. Unfortunately, they are pretty close.

It's hard to dig deeper without investing considerable amounts of time: Considering that these charts are coming from the balanced dataset, the only differences are minor. For now, we will consider missing values as "interesting" but not distinct enough between the two target groups (target 0 and target 1).

The data in both splits seems to follow clear patterns and might even be "clusterable" somehow.

Something we could try (future!) would be to try using KNN to cluster these loans based on groups of features missing (eg. one dimension could be "missing data related to LIVINGAPARTMENTS_*", another dimension would be "is it missing data related to AMT_REQ_CREDIT_*").

NA distribution in promising/interesting features¶

The following features show scattered/unpredictable NAs that don't follow the patterns of the rest of the cols

  • OWN_CAR_AGE
  • OCCUPATION_TYPE
  • EXT_SOURCE_1

We will mark those on the following chart, in **orange**

We use a simple bootstrap to understand our distributions of NAs.

We're also interested in seeing if there is much variation between one sample and the next, so we will plot the barchart with 95% CI

In [45]:
@run
@cached_chart()
def eda_plot_nas_from_bootstrap():
    marks = {
        "grey": [
            "DAYS_BIRTH",
            "REGION_POPULATION_RELATIVE",
            "AMT_REQ_CREDIT_BUREAU_YEAR",
            "YEARS_BUILD_MEDI",
            "YEARS_BUILD_AVG",
            "YEARS_BUILD_MODE",
            "AMT_ANNUITY",
            "HOUR_APPR_PROCESS_START",
            "DAYS_LAST_PHONE_CHANGE",
        ],
        "lightgrey": [f"FLAG_DOCUMENT_{i}" for i in range(2, 22)],
        "orange": [
            "EXT_SOURCE_1",
            "EXT_SOURCE_2",
            "EXT_SOURCE_3",
            "OWN_CAR_AGE",
            "OCCUPATION_TYPE",
        ],
    }
    return eda_utils.plot_nas_bootstrapped(app_sample_balanced, marked_columns=marks)
Loading from cache [./cached/charts/eda_plot_nas_from_bootstrap.png]

It seems most of the fields that we identified are fully present. Only a handful of them are missing some data.

Some of the things we wanted to try will be minorly affected by the number of NAs, so this only represents a small issue.

Mutual Information / Information Gain¶

Another simple way to understand our features is to use another type of Filter-Based Feature Selection Technique called Information Gain or Mutual Information.

Let's take a look at which fields might be helpful to us. Remember that mutual information does not include interactions between fields and just scores them on their individual contribution (their mutual dependence).

In our case, we will score each of the fields against our target field.

Note that this custom function for mutual_info() makes some generalizations and its results are orientative (eg. it auto-selects feature with "few" unique values as "discrete" instead of continuous, it drops rows with NAs, etc...). This chart is only an initial exploration, and it only includes contributions from individual features: it does not take into account feature interactions, so "low scores" are not necessarily and indication that we should drop the feature without further analysis.

In [46]:
@run
@cached_chart()
def eda_mutual_information():
    cols_to_encode = [
        "NAME_CONTRACT_TYPE",
        "CODE_GENDER",
        "FLAG_OWN_CAR",
        "FLAG_OWN_REALTY",
        "NAME_TYPE_SUITE",
        "NAME_INCOME_TYPE",
        "NAME_EDUCATION_TYPE",
        "NAME_FAMILY_STATUS",
        "NAME_HOUSING_TYPE",
        "OCCUPATION_TYPE",
        "WEEKDAY_APPR_PROCESS_START",
        "ORGANIZATION_TYPE",
        "FONDKAPREMONT_MODE",
        "HOUSETYPE_MODE",
        "WALLSMATERIAL_MODE",
        "EMERGENCYSTATE_MODE",
    ]

    f, ax = plt.subplots(1, 2, figsize=(15, 7), sharex=True)
    mi_u = eda_utils.mutual_info(
        app_sample,
        "TARGET",
        id_cols=["SK_ID_CURR"],
        is_classification=True,
        cols_to_encode=cols_to_encode,
    )
    mi_b = eda_utils.mutual_info(
        app_sample_balanced,
        "TARGET",
        id_cols=["SK_ID_CURR"],
        is_classification=True,
        cols_to_encode=cols_to_encode,
    )
    plt.title("mutual information analysis (top 20 fields)")
    ax[0].set_title("using original sample (imbalanced)")
    ax[1].set_title("using balanced sample")
    sns.barplot(y=mi_u.head(25).index, x=mi_u.head(25).values, color=moonstone, ax=ax[0])
    sns.barplot(y=mi_b.head(25).index, x=mi_b.head(25).values, color=moonstone, ax=ax[1])
    plt.tight_layout()
    return f
Loading from cache [./cached/charts/eda_mutual_information.png]

Remember that Mutual Information tells us how two features interact in terms of uncertainty. The mutual information (MI) between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. Also remember that it does not cover the 0-1 scale (like correlation would do), so these values are more valuable as a comparison among them, than as a how well does it score withing the valid range.

An despite this, it seems that most features have scores that are not too great (far ahead of the rest).

It might indicate that no one feature dominates the search space and that we will probably need complex models that can capture the complexity of interactions among the various features.

Removing Outliers and Visualizing Feature Distributions¶

We can see in some of the charts on the SweetViz report that a number of features are highly skewed, likely due to extreme outliers.

One of the key tasks we will want to perform is finding a strategy that helps us detect and remove outliers from any of the datasets.

We also want to visualize how our features are spread across the range of values.

In this section we want to:

  1. Build a cheatsheet for our dataset:
    • We want to visualize the data spread for all of our features, so we can quickly reference it later.
  2. Visualize which features are more affected by outliers.
    • We want to highlight our charts:
      • the height of the Y axis represents the spread of the original unmodified data
      • the spread of the █ green highlight band represents the spread of the data that remains after removing outliers (not outliers for that column, but rows that are outliers in the dataset, across all features).
      • Charts that are fully marked by green, have not changed much before/after removing outliers.
      • Charts that have a narrow green band have changed a lot and our feature selection algorithm (Isolation Forest, a type of Embedded Feature Selection Technique) has dropped most of the extreme cases.

Categorical variables are excluded (for now) from this specific wave of Feature Selection.

In [47]:
def read_eda(tablename: str) -> pd.DataFrame:
    return auto_pandas.normalize_columns(
        pd.read_sql(f"SELECT * FROM {tablename};", eda_conn)
    )
In [48]:
def add_metadata_for_outliers(df: pd.DataFrame) -> pd.DataFrame:
    """
    adds metadata indicating the min and max values if we were to ignore outliers.
    The outliers are not removed, so they can still be plotted properly
    """
    outl = clean.isolation_forest_outliers(
        df.drop(
            columns=["sk_id_prev", "sk_id_bureau", "sk_id_curr", "pk"], errors="ignore"
        ),
        contamination=0.05,
        return_outliers_only=False,
    )
    inliers = outl[~outl["predicted_outlier"]]
    df.attrs["mins"] = inliers.min()
    df.attrs["maxs"] = inliers.max()
    return df
In [49]:
@cached_dataframes()
def eda_dfs_with_enriched_mixmax_after_removing_outliers():
    tables = {
        "application": read_eda("application_hyper"),
        "bureau": read_eda("bureau"),
        "bureau_balance": read_eda("bureau_balance"),
        "credit_card_balance": read_eda("credit_card_balance"),
        "installments_payments": read_eda("installments_payments"),
        "pos_cash_balance": read_eda("pos_cash_balance"),
        "previous_application": read_eda("previous_application"),
    }
    return {name: add_metadata_for_outliers(df) for name, df in tables.items()}
In [50]:
tables = eda_dfs_with_enriched_mixmax_after_removing_outliers()
Loading from cache [./cached/df_dict/eda_dfs_with_enriched_mixmax_after_removing_outliers.h5]
In [51]:
def plot_min_max(ax: plt.Axes, df: pd.DataFrame, colname: str) -> None:
    inliers_min = df.attrs["mins"][colname]
    inliers_max = df.attrs["maxs"][colname]
    ax.axhspan(inliers_min, inliers_max, color="green", alpha=0.08)
In [52]:
@run
@cached_chart()
def eda_dist_app():
    return chart_utils.plot_all_features(
        tables["application"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_app.png]
In [53]:
@run
@cached_chart()
def eda_dist_h_previous_application():
    return chart_utils.plot_all_features(
        tables["previous_application"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_h_previous_application.png]
In [54]:
@run
@cached_chart()
def eda_dist_bureau():
    return chart_utils.plot_all_features(
        tables["bureau"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_bureau.png]
In [55]:
@run
@cached_chart()
def eda_dist_bureau_balance():
    return chart_utils.plot_all_features(
        tables["bureau_balance"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_bureau_balance.png]
In [56]:
@run
@cached_chart()
def eda_dist_credit_balance():
    return chart_utils.plot_all_features(
        tables["credit_card_balance"],
        ax_callback=plot_min_max,
        color=moonstone,
    )
Loading from cache [./cached/charts/eda_dist_credit_balance.png]
In [57]:
@run
@cached_chart()
def eda_dist_installments_payments():
    return chart_utils.plot_all_features(
        tables["installments_payments"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_installments_payments.png]
In [58]:
@run
@cached_chart()
def eda_dist_pos_balance():
    return chart_utils.plot_all_features(
        tables["pos_cash_balance"], ax_callback=plot_min_max, color=moonstone
    )
Loading from cache [./cached/charts/eda_dist_pos_balance.png]

In these charts, we can see how the values are distributed, for all of the fields in all of the datasets.

For fields that are numeric, we are plotting histograms and we're including a light-green band. This green band shows the min and max values in the dataset, AFTER removing outliers using Isolation Forests:

In the picture below we can see how some fields are severely affected, while others are minorly changed.

This is because not all features present the same behaviour/outliers. The current cutoff points are using a 0.05 contamination threshold, but this parameter can be changed if we wanted to tweak if further. Any missing values have been imputated with the median value of the column.

This gives us an idea of which columns affect more the feeling of "Outlier" and which ones do not. We plan on using the same strategy to prepare our training dataset to help our model understand the core patterns in our data.

Remember that the outlier is being done on the entire dataset, but that the charts are only sampling 20% of the data (to speed up the rendering), as this is only to get a general understanding of viability.

We will do the actual removal on the training dataset at a later stage.

Exploring Manual Feature Engineering¶

This dataset is given to us already pretty clean (even normalized across time features using relative time for all of the individual loans).

Despite this, it still has a lot of potential for further Feature Engineering.

I wanted to learn from the best, so I did a quick scan around the winning team's kaggle solution/discussion and I found some really interesting FE efforts, and I wanted to replicate some of these to practice.

The first one, particularly, is quite mindblowing. They are using 3 of the most strong features (EXT_SOURCE_1/2/3) to create a 3 dimensional space/cube, and using KNN to learn about patterns inside it. They are using Target Mean Encoding to understand, for any point in that space, how are its 500 closest neighbors performing around being able to pay the loan without issues. Well: They actually use a 4th dimension (a tesseract instead of a cube) to also encode the ratio of credit vs annuity, but for me, visualizing a 3D cube is a lot easier than a 4D tesseract, so allow me this creative simplification during this explanation).

The way they explain is:

neighbors_target_mean_500: The mean TARGET value of the 500 closest neighbors of each row, where each neighborhood was defined by the three external sources and the credit/annuity ratio

This allows them to do some very interesting version Target Mean Encoding, using its closest neighbors, allowing them to convert 4 continuous features to encode a success ratio.

Fastinating!

Anyway, I wanted to try replicating this, to practice my FE skills, but I also wanted to try to include 6 more versions of this, their simplified siblings, using only some subsets of data.

Making derived features from our strongest predictors¶

These are the features I envisioned:

feature EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 credit/annuity ratio
knn_500_tme_1 ✅ ✅
knn_500_tme_2 ✅ ✅
knn_500_tme_3 ✅ ✅
knn_500_tme_12 ✅ ✅ ✅
knn_500_tme_23 ✅ ✅ ✅
knn_500_tme_13 ✅ ✅ ✅
knn_500_tme_all ✅ ✅ ✅ ✅

all the features marked with ✅ will be used for that new calc and .fillna(0) when necessary.

The reason I wanted to create the 6 simplified versions is that some of the features have a high number of NAs (particularly EXT_3) and I wanted to make sure that our model was able to still use other "fallback" versions of this information, even if one/two of the fields was missing (think of it as flattening the "cube" to a plane that is one of its sides). I imagine that they also tried this but they likely didn't get too good performance, so they did not mention them in their discussion, so I don't expect the 6 simplified versions to perform amazingly, but that's ok. This exercise is about practicing elegant ways to encode and engineer features.

I haven't decided if my model will be given all 7 combinations, or if it will be given the subset of features that we consider useful depending on the NAs it has. Eg. if the row to predict only has feature 1, 3, and credit/annuity... I haven't decided whether I want to give it all 7 features, or just the best 3 (in this case: knn_500_tme_13, knn_500_tme_1 and knn_500_tme_3)

In [ ]:
df_fe_knns = eda_utils.balance(tables["application"].copy(), "target")
df_fe_knns["credit_annuity_ratio"] = df_fe_knns["amt_credit"] / df_fe_knns["amt_annuity"]
df_fe_knns = df_fe_knns[
    ["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio", "target"]
]
df_fe_knns_nas = df_fe_knns.copy()
df_fe_knns_nas[["ext_source_1"]] = df_fe_knns_nas[["ext_source_1"]].fillna(0)
df_fe_knns_nas[["ext_source_2"]] = df_fe_knns_nas[["ext_source_2"]].fillna(0)
df_fe_knns_nas[["ext_source_3"]] = df_fe_knns_nas[["ext_source_3"]].fillna(0)

feateng.target_mean_encoding_using_knn(
    df_fe_knns_nas,
    "target",
    ["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio"],
)

Let's visualize this cube from each of the sides:

  • In Blue: x=EXT_1 vs y=EXT3
  • In Orange: x=EXT_1 vs y=EXT2
  • In Magenta: x=EXT_2 vs y=EXT3
In [ ]:
@run
@cached_chart()
def feateng_knn_visualize_target_spread():
    chart_utils.visualize_3d_cube(
        df_fe_knns_nas,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="target",
        title="Spread of TARGET over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
    )
    return plt.gca()
In [ ]:
filename = "cached/charts/feateng_rotate_extsources.gif"
if not os.path.exists(filename):
    chart_utils.spinning_3d_plot(
        df_fe_knns_nas,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="target",
        title="Spread of TARGET over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
        export_gif_path=filename,
    )
else:
    print("loading pre-render")

Let's check the file we created, to get a better understanding

There seems to be a pattern to the data.

  • Higher values in the dimensions tend to go better (no difficulties paying).
  • There seem to be a non negligible amount of rows with NAs. They are getting flattened to a plane/line. I hope the other derived features (partial) will help.

Let's try to apply KNN to do target mean encoding, and also visualize it.

Target Mean Encoding using KNN¶

This renders the actual spread of TARGET, let's try to visualize the same but for the knn(500) instead of "its own target".

Once again, we will use a subset of the features (3) so we can visualize it in space. On the actual encoding, we will use all 4 features.

In [ ]:
# Since this dataset is a lot smaller than the Training one, we will use a smaller number of datapoints (the ratio is 1/7)
knn500 = NearestNeighbors(n_neighbors=500 // 7, algorithm="ball_tree")
display(knn500)

extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
nbrs = knn500.fit(df_fe_knns_nas[extscores])

distances, indices = knn500.kneighbors(df_fe_knns_nas[extscores])
indices = indices[:, 1:]

df_fe_knns_nas["neighbors_target_mean_knn"] = [
    df_fe_knns_nas["target"].iloc[indices[i]].mean() for i in range(len(indices))
]
In [ ]:
df_fe_knns_nas
In [ ]:
@run
@cached_chart()
def feateng_knn_visualize_target_knn():
    chart_utils.visualize_3d_cube(
        df_fe_knns_nas,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="neighbors_target_mean_knn",
        title="Spread of target_mean over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
        palette="PuOr",
    )
    return plt.gca()
In [ ]:
filename = "cached/charts/feateng_rotate_extsources_knn.gif"
if not os.path.exists(filename):
    chart_utils.spinning_3d_plot(
        df_fe_knns_nas,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="neighbors_target_mean_knn",
        title="Spread of target_mean over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
        palette="PuOr",
        export_gif_path=filename,
    )
else:
    print("loading pre-render")

Let's check the file we created, to get a better understanding

The trend seems pretty obvious, but we're worried that the "flattened"/nafilled values that are on the zero plane, will likely pollute the values that are really close to them.

Since the relationship between these 3 features and the target looks quite simple (higher value is better), and it's the same for all three, we're considering using a smarter imputation. We could consider using any of the 3 features we have to impute any of the missing ones. As long as 1 of the features is present, we should be able to improve/enrich our data.

In [ ]:
df_fe_knns = eda_utils.balance(tables["application"].copy(), "target")
df_fe_knns["credit_annuity_ratio"] = df_fe_knns["amt_credit"] / df_fe_knns["amt_annuity"]
df_fe_knns = df_fe_knns[
    ["ext_source_1", "ext_source_2", "ext_source_3", "credit_annuity_ratio", "target"]
]

extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
imputer = KNNImputer(n_neighbors=100)

df_fe_knns_imputed = df_fe_knns.copy()
df_fe_knns_imputed[extscores] = imputer.fit_transform(df_fe_knns[extscores])
df_fe_knns_imputed
In [ ]:
knn500 = NearestNeighbors(n_neighbors=500 // 7, algorithm="ball_tree")
display(knn500)

extscores = ["ext_source_1", "ext_source_2", "ext_source_3"]
nbrs = knn500.fit(df_fe_knns_imputed[extscores])

distances, indices = knn500.kneighbors(df_fe_knns_imputed[extscores])
indices = indices[:, 1:]

df_fe_knns_imputed["neighbors_target_mean_knn"] = [
    df_fe_knns_imputed["target"].iloc[indices[i]].mean() for i in range(len(indices))
]
In [ ]:
@run
@cached_chart()
def feateng_knn_visualize_target_knn_imputed():
    chart_utils.visualize_3d_cube(
        df_fe_knns_imputed,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="neighbors_target_mean_knn",
        title="Spread of target_mean over pairs of EXT_SOURCE_*",
        elevation=70,
        rotation=-30,
        palette="PuOr",
    )
    return plt.gca()
In [ ]:
filename = "cached/charts/feateng_rotate_extsources_knn_imputed.gif"
if not os.path.exists(filename):
    chart_utils.spinning_3d_plot(
        df_fe_knns_imputed,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="neighbors_target_mean_knn",
        export_gif_path=filename,
        title="Spread of target_mean over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
        palette="PuOr",
    )
else:
    print("loading pre-render")

It's pretty clear that this space can easily be sliced into 2 usin a single plane. Based on what we're seeing, we could consider using an SVC, just to see how it performs.

But let's not food ourselves. Let's verify it:

In [ ]:
target_mean_0 = df_fe_knns_imputed[df_fe_knns_imputed["target"] == 0][
    "neighbors_target_mean_knn"
]
target_mean_1 = df_fe_knns_imputed[df_fe_knns_imputed["target"] == 1][
    "neighbors_target_mean_knn"
]
In [ ]:
@run
@cached_chart()
def eda_difference_between_target01_after_imputing():
    plt.title("Histogram of the spread of values after imputing using knn500 (scaled)")
    plot_zero = sns.histplot(target_mean_0, binwidth=0.03, label="target 0")
    plot_one = sns.histplot(target_mean_1, binwidth=0.03, label="target 1")
    plt.legend()
    return plt.gca()

And let's also apply a bit of statistical inferential analysis:

  • Two samples = (the imputated mean using KNN 500, of the cases that have a target of 0 vs 1)
  • $H_0 = $ There isn't a statistically significant difference between these two groups
  • $H_1 = $ There is a statistically significant difference between these two groups
  • Significance Level = 0.05
In [ ]:
stats.ttest_ind(target_mean_0, target_mean_1)

Since the pvalue is lower than our significance level, we must reject the null hypothesis. There might be something to this imputation using KNN(500) (scaled).

However, remember that this is just the EDA / exploration for feat eng.. and that the actual Feature Engineering will be using a 4D space because it will also include the credit/annuity ratio... We don't expect it to follow any patterns or have any correlations with any of the other 3 dimensions, but let's just check, to be extra sure:

In [ ]:
filename = "cached/charts/feateng_rotate_extsources_knn_imputed_ca_ratio.gif"
if not os.path.exists(filename):
    chart_utils.spinning_3d_plot(
        df_fe_knns_imputed,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="credit_annuity_ratio",
        export_gif_path=filename,
        title="Spread of credit_annuity_ratio over pairs of EXT_SOURCE_*",
        elevation=30,
        rotation=30,
        palette="PuOr",
    )
else:
    print("loading pre-render")

As expected, there are no discernible colour patterns when we colour using credit_annuity_ratio instead of using target. This means that this new feature encodes new information and that using it might be beneficial.

We expect that doing the same thing in the traning dataset but using 4 dimensions instead of 3 will actually really enrich our training dataset considerably.

For now, let's look at other ways to encode and include features:

Exploring Semi-Automated Feature Engineering¶

For this learning project, we would like to capture all of the information from the fact tables so our classifier has additional information. We could do things manually, or we could learn how to use a cool library that seems fitting.

The featuretools library provides an easy way to perform Deep Feature Synthesis, which seems to work promisingly well for hierarchical datasets (like the one we have).

Instead of us creating some of the features, we can use DFS to generate all of them and then assessing how well each of these contribute to our model's performance.

Before we get started, let's dump the supported types to spreadsheets so we are familiar with the transformations/aggregations it offers. Feel free to check them at your leisure if you are not familiar with DFS.

Since we plan on using dask later, the third column is also of particular interest.

In [ ]:
ft.list_primitives().head()
In [ ]:
if run_entire_notebook():
    ft.list_primitives().to_csv("feateng/docs/ftdfs_primitives_docs.csv")
    ft.list_logical_types().to_csv("feateng/docs/ftdfs_logical_types.csv")
    ft.list_semantic_tags().to_csv("feateng/docs/ftdfs_semantic_tags.csv")
    dask_primitives = ft.list_primitives()
    dask_primitives = dask_primitives[dask_primitives["dask_compatible"]]
    dask_primitives.to_csv("feateng/docs/ftdfs_primitives_docs__dask.csv")

Preparing our data for Deep Feature Synthesis (EDA Phase)¶

Now that we are more familiar with DFS datatypes, we will need to prepare our data.

This preparation will allow the library to undersand our data like we do, and optimize the transformations/aggregations in ways that make sense.

This requires:

  • [x] loading the data
  • [x] adding the dataframes into the entity set
  • [x] marking which columns can be used as indexes
  • [x] marking which columns refer to other tables (so ft can find relationships between them)
  • [x] ensuring that the types of columns are detected correctly or fixing them manually
    • This requires checking all 3 types of metadata:
DataFrame Physical Types Logical Types Semantic Tags
app ✅ ✅ ✅
prev ✅ ✅ ✅
bureau ✅ ✅ ✅
bureau_bal ✅ ✅ ✅
credit_bal ✅ ✅ ✅
pos_bal ✅ ✅ ✅
installments ✅ ✅ ✅

All of these tasks will be repeated in the "production" steps. We will export/store all of the lessons we learn so we can re-apply them identically to the other datasets as well (train/val/test).

In the cells below, you can see all of these tasks being done:

In [60]:
def set_primary_key(df, col: str):
    df.attrs["pk_col_name"] = col
    return df


def add_compound_key(df, col1: str, col2: str):
    df["pk"] = df[col1].astype(str) + "__" + df[col2].astype(str)
    df.attrs["pk_col_name"] = "pk"
    return df


def autonumeric_pk(df):
    df = df.reset_index(names="pk")
    df.attrs["pk_col_name"] = "pk"
    return df

Loading the data:

In [61]:
@cached_dataframes()
def eda_dfs_all_entities():
    h_app = read_eda("application_hyper")
    h_bureau = read_eda("bureau")
    h_bureau_balance = read_eda("bureau_balance")
    h_credit_balance = read_eda("credit_card_balance")
    h_installments_payments = read_eda("installments_payments")
    h_pos_balance = read_eda("pos_cash_balance")
    h_prev = read_eda("previous_application")

    h_app = set_primary_key(h_app, "sk_id_curr")
    h_prev = set_primary_key(h_prev, "sk_id_prev")
    h_bureau = set_primary_key(h_bureau, "sk_id_bureau")
    h_bureau_balance = add_compound_key(
        h_bureau_balance, "sk_id_bureau", "months_balance"
    )
    h_credit_balance = add_compound_key(h_credit_balance, "sk_id_prev", "months_balance")
    h_pos_balance = add_compound_key(h_pos_balance, "sk_id_prev", "months_balance")

    # has no compound de-facto unique column(s) we can use as PKs = we create ours
    h_installments_payments = autonumeric_pk(h_installments_payments)

    return {
        "application": h_app,
        "bureau": h_bureau,
        "bureau_balance": h_bureau_balance,
        "credit_card_balance": h_credit_balance,
        "installments_payments": h_installments_payments,
        "pos_cash_balance": h_pos_balance,
        "previous_application": h_prev,
    }

Creating an empty EntitySet that will contain the entire graph:

In [62]:
(es := ft.EntitySet("applications"))
Out[62]:
Entityset: applications
  DataFrames:
  Relationships:
    No relationships

Adding the dataframes into the EntitySet, along with their relationships

In [63]:
entities = eda_dfs_all_entities()

for df_name, df in entities.items():
    index_col = df.attrs["pk_col_name"]
    print("Adding", df_name, "using index =", index_col)
    ftdfs_utils.attach_df_into_es(es, df, df_name, index_col, "feateng")

del entities
Loading from cache [./cached/df_dict/eda_dfs_all_entities.h5]
Adding application using index = sk_id_curr
Adding bureau using index = sk_id_bureau
Adding bureau_balance using index = pk
Adding credit_card_balance using index = pk
Adding installments_payments using index = pk
Adding pos_cash_balance using index = pk
Adding previous_application using index = sk_id_prev
In [64]:
def rel(t1, c1, t2, c2):
    return ft.Relationship(es, t1, c1, t2, c2)


relationships = [
    rel("application", "sk_id_curr", "bureau", "sk_id_curr"),
    rel("bureau", "sk_id_bureau", "bureau_balance", "sk_id_bureau"),
    rel("application", "sk_id_curr", "previous_application", "sk_id_curr"),
    rel("previous_application", "sk_id_prev", "pos_cash_balance", "sk_id_prev"),
    rel("previous_application", "sk_id_prev", "installments_payments", "sk_id_prev"),
    rel("previous_application", "sk_id_prev", "credit_card_balance", "sk_id_prev"),
    rel("application", "sk_id_curr", "pos_cash_balance", "sk_id_curr"),
    rel("application", "sk_id_curr", "installments_payments", "sk_id_curr"),
    rel("application", "sk_id_curr", "credit_card_balance", "sk_id_curr"),
]

es.add_relationships(relationships)
Out[64]:
Entityset: applications
  DataFrames:
    application [Rows: 30751, Columns: 122]
    bureau [Rows: 146263, Columns: 17]
    bureau_balance [Rows: 1474814, Columns: 4]
    credit_card_balance [Rows: 322932, Columns: 24]
    installments_payments [Rows: 1159358, Columns: 9]
    pos_cash_balance [Rows: 852590, Columns: 9]
    previous_application [Rows: 141265, Columns: 37]
  Relationships:
    bureau.sk_id_curr -> application.sk_id_curr
    bureau_balance.sk_id_bureau -> bureau.sk_id_bureau
    previous_application.sk_id_curr -> application.sk_id_curr
    pos_cash_balance.sk_id_prev -> previous_application.sk_id_prev
    installments_payments.sk_id_prev -> previous_application.sk_id_prev
    credit_card_balance.sk_id_prev -> previous_application.sk_id_prev
    pos_cash_balance.sk_id_curr -> application.sk_id_curr
    installments_payments.sk_id_curr -> application.sk_id_curr
    credit_card_balance.sk_id_curr -> application.sk_id_curr

Manually tweaking the data types¶

Let's persist to disk the auto-detected info (logical/semantic) and check it manually.

In [65]:
if run_entire_notebook():
    for name, df in es.dataframe_dict.items():
        f_logi = f"feateng/{name}____logical_types.json"
        f_semantic = f"feateng/{name}____semantic_tags.json"
        ftdfs_utils.save_logical_types(es, name, f_logi)
        ftdfs_utils.save_semantic_tags(es, name, f_semantic)
skipping optional operation

After a quick inspection of the files, we can see that there are a few bits of metadata that could be improved.

I'll document a couple of examples, of the many found, so you get an idea of the types of mismatches found:

  • A lot of the fields that are auto-detected as Integer could actually be marked as Boolean, potentially making things a lot easier for DFS (eg, no need to encode it, etc..). We will modify the config files on disk manually to reflect these changes. All of the files have been manually reviewed and tweaked to encode these minor details that, we expect will have an impact at computation time (either improving performance, reducing computational costs or minimizing memory footprints)
  • The proper way to fix this would be to actually change the data in the field so it's stored as True/False instead of 0/1, but for some of the aggregations, we want to see if they distribute in specific ways, so keeping them as numbers might keep more info that would be lost in Boolean (eg. with Boolean we can use few aggregate functions: MODE, etc.. but with numerics we can use many more: MIN, MAX, AVG, and compare across customers from different clusters: do they have different AVGs? which would not be possible with Booleans)

Let's keep things simple (for now)¶

Some of the detected types have been tweaked manually (check the files under /feateng).

Some of these auto-detected types could be improved further (eg: we could mark some of these as Timedelta instead of Integer to indicate that these numbers are not quantities but durations in time).

However, we will not go to such lenghts right now. Let's first see how things perform with basic tweaking and let's come back if we see that the performance needs to be improved even further.

Scaling this beyond small datasets¶

This is all nice and good, but so far we are just thinking about EDA.

Before continuing, let's make sure that DFS will work with the use cases that we have in mind.

We want to be familiar enough with the trade-offs to discard this tool, should we find a major dealbreaker.

For this, we need to spend a bit of time reviewing docs, and also assessing these tradeoffs with our intentions/data.

Below we're documenting some of the more well known issues/trade-offs found while exploring the documentation as well as our thoughts on how they might impact the viability as a key tool for the rest of the project.

Exploring Deep Feature Synthesis on Dask!¶

We need to remember that this laptop is kind of old, and that the training dataset is 7x larger than the current dataset... and some of those JOINs will likely increase our memory footprint (not just by a factor of x7!), so this project looked like a good opportunity to learn about and practice with Dask.

  • 🟩 Good news: Featuretools supports Dask for Deep Feature Synthesis!
  • 🟥 Bad News: Some things are not supported, and some others are still in beta.

Limited support for Dask¶

While featuretools supports Dask for Deep Feature Synthesis, it does not support all of the operations/transformations.

The official documentation reads (as of Oct-2023)

Supported Primitives When creating a feature matrix from a Dask EntitySet, only certain primitives can be used. Primitives that rely on the order of the entire DataFrame or require an entire column for computation are currently not supported when using a Dask EntitySet. Multivariable and time-dependent aggregation primitives also are not currently supported.

(It actually makes sense!)

This greatly reduces the amount of transformations that are available to us, but since this is a learning project, not a production project, we're happy to trade of some predictive performance in our model, in exchange for learning this interesting technology that seem to fit really well our problem at hand!

These missing features don't really seems a dealbreaker (so far!)

Let's continue!

Due diligence when working with DFS on Dask¶

DFS seems promising, but since it's the first time we try it, we want to make sure we're familiar with some caveats and limitations that we will encounter when we use it with dask.

We've documented our findings and how some of them might impact us.

Regarding UNIQUENESS in index columns¶

Documentation reads:

By default, Woodwork checks that pandas DataFrames have unique index values. Because performing this same check with Dask would require an expensive compute operation, this check is not performed when adding a Dask DataFrame to an EntitySet. When using Dask DataFrames, users must ensure that the supplied index values are unique.

This was actually already done during the first passes/scans around the data, to guarantee that PKs were unique. eg.:

select SK_ID_BUREAU, count(*)
from BUREAU
group by SK_ID_BUREAU
having count() > 1;

0 rows returned


This is also the reason that we created specific custom PK columns for the 3 tables that did not have guaranteed UNIQUE constraints on the IDs (check the use of add_compound_key() above for details)

Regarding inference of Logical Types¶

When adding a DataFrame to an EntitySet, Woodwork will attempt to infer the logical types for any columns that do not have a logical type defined. This inference process can be quite expensive for Dask DataFrames

This will be handled manually to ensure that all the types are set correctly and cheaply. Check the json files on the repo for a dump of all the configurations for logical types and for semantic tags. remember: these are the files we tweaked manually earlier!

Regarding mixing types¶

When creating a Featuretools EntitySet that will be made of Dask DataFrames, all of the DataFrames used to create the EntitySet must be of the same type, either all Dask DataFrames or all pandas DataFrames. Featuretools does not support creating an EntitySet containing a mix of Dask and pandas DataFrames.

We don't plan on mixing dask and pandas.

Regarding inference of interesting values¶

Additionally, EntitySet.add_interesting_values() cannot be used in Dask EntitySets to find interesting values; however, it can be used set a column’s interesting values with the values parameter

We plan on using the library to infer the Interesting Values on Pandas (during EDA), and then exporting those insights for the full run on Dask! (Smart? Insane? Let's find out!) 😉 Wish us luck!

Regarding using cutoff times for time analysis¶

There are a few key limitations when generating a feature matrix from a Dask EntitySet. If a cutoff_time parameter is passed to featuretools.dfs() it should be a single cutoff time value, or a pandas DataFrame.

Additionally, Featuretools does not currently support the use of the approximate or training_window parameters when working with Dask EntitySets

We do not plan on using this for time analysis (plus, this dataset is already really clean! All the time-related preprocessing work has already been done for us!)

Regarding NANs on the target variable¶

Finally, if the output feature matrix contains a boolean column with NaN values included, the column type may have a different datatype than the same feature matrix generated from a pandas EntitySet

Not our situation

Regarding the memory footprint¶

In some instances, generating a feature matrix with a large number of features has resulted in memory issues on Dask workers. The underlying reason for this is that the partition size of the feature matrix grows too large for Dask to handle as the number of feature columns grows large. This issue is most prevalent when the feature matrix contains a large number of columns compared to the DataFrames in the EntitySet. Possible solutions to this problem include reducing the partition size used when creating the DataFrames or increasing the memory available on Dask workers.

Oh! This is going to be a BIG problem! Let's watch out for this one, very closely!

Regarding Automatic Feature Encoding¶

Currently featuretools.encode_features() does not work with a Dask DataFrame as input. This will hopefully be resolved in a future release of Featuretools.

We had no idea this was a possibility. So we lost nothing... but we're still sad.

Other Utilities¶

The use of featuretools.remove_low_information_features() cannot currently be used with a Dask feature matrix.

Not sure we will need it but we will try.

If required, we might try the pandas-sample-first-dask-for-prod-later approach.

On creating custom DFS "Feature" (a DFS aggregator/transformer)¶

When manually defining a Feature, the use_previous parameter cannot be used if this feature will be applied to calculate a feature matrix from a Dask EntitySet.

This seems to refer to creating custom Features. If we were to create some, we won't be allowed to specify timeframes (so this is just another consequence of other limitations listed above: when on Dask, total ordering and availability of all data is not guaranteed. This also impacts how Features can be created)

This will not affect us:

  • We will not be able to use existing Features that use time frames.
  • We will not be able to create Features that use time frames (this is what this disclaimer is saying)

Conclusion¶

That's it! These are all of the trade-offs that we found when using DFS on dask. While things are not excellent, they are pretty good! The parts that we lose are very minor. The only big problem is the one about the memory footprint, but that paragraph is just a reminder that applies to all cases where we work with large amounts of data, not really a thing we would avoid if we tried to use pandas instead of dask!

So, all minor issues, and the major one would still be there even if we skipped dask. In fact, we are considering using dask precisely for that issue, so nothing is lost.

With this in mind, I think we can continue with this plan for now.

Reference - Dask Tutorial 2h

https://www.youtube.com/watch?v=Tl8rO-baKuY

Exploring Deep Feature Synthesis - First manual pass¶

Let's do a first pass with DFS on real data and see what the output looks like.

The ultimate goal is to calculate feature importance on the featres (old and newly generated) and see if they actually contribute when predicting.

In [66]:
# using 1 got us 520 feats
# using 2 we got +1500 feats
# using 3 we got +5800... let's dial it down
dfs_depth = 1

prod_settings.put("dfs_depth", dfs_depth)

feats_imbalanced, feat_definitions = ft.dfs(
    entityset=es,
    target_dataframe_name="application",
    agg_primitives=["mean", "sum", "min", "max", "count", "percent_true"],
    trans_primitives=None,
    max_depth=dfs_depth,
    features_only=False,
)
In [67]:
feats_balanced = eda_utils.balance(feats_imbalanced, "target")
In [68]:
if run_entire_notebook("sweetviz_report_after_dfs"):
    sv_report = sv.analyze(feats_balanced, target_feat="target", pairwise_analysis="off")
    sv_report.show_html("cached/sweetviz_report_after_dfs.html")

display(IFrame("cached/sweetviz_report_after_dfs.html", width=1600, height=600))
skipping optional operation

Some interesting insights from these features (some of the lessons learned are from previous runs with depth=2 or =3):

  • features derived from previous_application.credit_card_balance.* have a high number of NAs (> 70%)
    • we will likely have to skip most of those combinations
  • the majority of features that involve MAX() or MIN() and the derived features that use those ones, have a very imbalanced chart which makes them not really useful due to their extremely high skew.
    • imatge.png
    • The trend of the line does not matter anymore seeing how all of the datapoints are concentrated in the first column
    • We will likely skip MIN and MAX features in future runs
  • Some of the metrics generated do look good:
    • all buckets have a similar number of datapoints
    • the line that tracks mean TARGET does change clearly along the spectrum
    • have a low % of NAs
    • imatge.png
    • Really good! but they are rather rare, and they require the use on MIN/MAX functions ...
  • Some of the combinations don't make sense. eg: SUM(MIN()), which shows 1 single col (predictably)
    • imatge.png
  • Some, very interestingly, are 100% NAs:
    • imatge.png

It's pretty clear that we will be unable to manually assess so many columns. We want to semi-automatically discard the columns that do not help contribute to the target column.

Exploring Automating Feature Selection¶

Now that we are using the balanced subset to make sure we get a fairer assessment.

In [69]:
num_feats = feats_balanced.select_dtypes(exclude="category")
cat_feats = feats_balanced.select_dtypes(include="category")

hyperparams = {"subsample": 0.5, "colsample_bytree": 0.5}

lgbm_importance_df = featsel.feature_selection_lgbm(
    feats_balanced,
    "target",
    list(set(num_feats.columns) - {"target"}),
    list(set(cat_feats.columns)),
    objective="binary",
    hyperparams=hyperparams,
)
lgbm_importance_df.head(10)
Out[69]:
feature importance
0 ext_source_2 78
1 ext_source_3 75
2 ext_source_1 65
3 days_birth 62
4 organization_type 49
5 MEAN(bureau.days_credit_enddate) 40
6 MAX(bureau.days_credit_enddate) 35
7 days_employed 34
8 days_registration 33
9 MEAN(previous_application.cnt_payment) 31

Let's try to visualize how the feature importance spreads across all of our features:

(We don't want to analyze them visually, we just want to see the overall shape/spread)... as it turns out, it seems that the default plot cannot be customized either.

Note that in this case, we will use LGBM just because it's a bit faster than XGBoost, while both achieving similar performance (we compared them).

In [ ]:
prod_settings.put("dfs_most_important_feats", list(lgbm_importance_df.head(100)["feature"]))
In [73]:
@run
@cached_chart()
def eda_feat_importance():
    plt.figure(figsize=(9, 20))
    return sns.barplot(
        data=lgbm_importance_df.head(100),
        y="feature",
        x="importance",
        color=moonstone,
    )
<AxesSubplot: xlabel='importance', ylabel='feature'>

This seems reasonable:

  • Few features with high scores
  • A lot of features that don't contribute much, towards the bottom (not shown in the graph)
  • And an overall log-like curve when sorting them in descendent order
  • The top rated features also appear repeatedly in most other models (xgboost/lgbm/...)

This looks really good. Here's a few things we like seeing:

  • We see strong similarities with the features that we visually identified in the first pass, during our visual inspection of the raw dataset (without Deep Feature Synthesis):
    • Some examples are EXT_SOURCE_*, DAYS_BIRTH, etc..
    • go see our summary after the Visualizing Data section
  • Some of the generated features are on par with some of the original features, indicating that there are some interesting patterns and trends hiding outside of the primary table.
  • Our concerns that "maybe using dask would limit which features we could create", while true, might not be as big of a dealbreaker: we're getting some interesting results even with basic calculations.
  • The vast majority of features have a low score (as expected, since this was created by brute-forcing all combinations, with little logic or rationale), which means we can easily remove a lot of them and significantly reduce the problems that would come from the Cuuuuurse of Dimensionality!

Here are some things we'd like to see:

  • Where are the FLAG_DOCUMENT_xx fields? They seemed important in our initial inspection! Were we wrong?
  • We tried switching them from num to categorical but the scores are very similar so it doesn't seem like it's a problem of the model understanding them wrong.
In [89]:
lgbm_importance_df.loc[lgbm_importance_df["feature"].str.contains("docu")].head(10)
Out[89]:
feature importance
275 flag_document_3 2
327 flag_document_6 1
345 flag_document_18 1
367 flag_document_17 0
378 flag_document_4 0
380 flag_document_14 0
381 flag_document_7 0
385 flag_document_13 0
391 flag_document_2 0
395 flag_document_5 0

After checking visually a second time and re-reading our original text. It seems that I was misremembering.

The original text read:

The documents (FLAG_DOCUMENT_xx). When documents are provided, the risk decreases noticeably.

Which was true. In my head I accidentally remembered as "FLAG_DOCUMENT_xx" are helping predict the target, but due to the high imbalance of the data and how many few cases there are of people providing the documents, it doesn't help the overall case.

Basically, if you can provide the documents, you are likely to be better off... but this is so rare that using these fields for prediction will not be useful and we have to use other fields that will contribute more data!

In summary: this feature importance analysis has been very useful to avoid fooling ourselves! It is helping us spot which fields we will keep in the next steps; and in the rare cases where we saw things that didn't make sense, we were able to go back and find out that lgbm was much better at it than we were (manually)!

Let's do one last check just to be EXTRA sure. Let's compare the distribution of NAs for these generated features:


Let's also compare how the NAs are spread in this subsampled-balanced dataset of the most useful 100 features.

In [90]:
top_100_feats_by_importance = feats_balanced[
    ["target"] + list(lgbm_importance_df["feature"].head(100).values)
]
In [91]:
@run
@cached_chart()
def dfs_nas_after_feat_selection_split():
    return msno.matrix(
        top_100_feats_by_importance, sort="descending", color=moonstone_rgb_n
    )
Loading from cache [./cached/charts/dfs_nas_after_feat_selection_split.png]

Not bad. Let's also compare them across the different values of TARGET, in case there were obvious patterns:

In [92]:
@run
@cached_chart()
def dfs_nas_after_feat_selection_split_by_target():
    return plot_missingnos_comparison(
        feats_balanced[["target"] + list(lgbm_importance_df["feature"].head(100).values)],
        "target",
        color=moonstone_rgb_n,
        sort_first=True,
    )
Loading from cache [./cached/charts/dfs_nas_after_feat_selection_split_by_target.png]

Nothing unexpected.

Let's also identify how the NAs stack up in "original columns" vs "newly generated columns". We will mark the DFS ones:

In [93]:
@run
@cached_chart()
def eda_plot_nas_after_dfs():
    derived_col = top_100_feats_by_importance.columns[
        top_100_feats_by_importance.columns.str.contains("\(")
    ]
    marks = {"grey": derived_col}
    ax = eda_utils.plot_nas_bootstrapped(
        top_100_feats_by_importance, marked_columns=marks
    )
    return ax
Loading from cache [./cached/charts/eda_plot_nas_after_dfs.png]

Storing DFS config¶

Let's store the configuration of "top features to generate" so that, when we train, we only generate those ones, in order to save some valuable computing time and memory:

In [114]:
def save_dfs_features(filename: str):
    pick_best_cols = 100
    top_columns = list(lgbm_importance_df.head(pick_best_cols)["feature"])

    generated_feats_names_all = [
        feature.get_name()
        for feature in feat_definitions
        if ("IdentityFeature" not in str(type(feature)))
    ]

    generated_feats_names_good = [
        feat for feat in generated_feats_names_all if feat in top_columns
    ]

    selected_features = [
        feature
        for feature in feat_definitions
        if feature.get_name() in generated_feats_names_good
    ]
    ft.save_features(feats_for_dfs, location=filename)


if not os.path.exists(ProdSettings.PATH_DFS_FEATURE_DEFINITIONS):
    save_dfs_features(ProdSettings.PATH_DFS_FEATURE_DEFINITIONS)
    print("feature definitions saved")
else:
    print("file", dfs_feat_defs_filename, "already exists")
file prod/dfs_feature_definitions.json already exists

Exploring spread of NAs between different target groups¶

In [94]:
print(feats_balanced[feats_balanced["target"] == 0].shape)
print(feats_balanced[feats_balanced["target"] == 1].shape)
(2482, 521)
(2482, 521)
In [95]:
na_comparison = pd.DataFrame(
    {
        "target_0": feats_balanced[feats_balanced["target"] == 0].isna().sum()[0:100],
        "target_1": feats_balanced[feats_balanced["target"] == 1].isna().sum()[0:100],
    }
)
na_comparison = na_comparison[
    (na_comparison["target_0"] > 0) & (na_comparison["target_1"] > 0)
]
na_comparison["ratio"] = na_comparison["target_0"] / na_comparison["target_1"]
display(na_comparison.sort_values(by="ratio", ascending=False).head(10))
display(na_comparison.sort_values(by="ratio", ascending=False).tail(10))
target_0 target_1 ratio
def_60_cnt_social_circle 12 4 3.000000
obs_30_cnt_social_circle 12 4 3.000000
obs_60_cnt_social_circle 12 4 3.000000
def_30_cnt_social_circle 12 4 3.000000
occupation_type 814 614 1.325733
amt_goods_price 5 4 1.250000
ext_source_2 3 3 1.000000
own_car_age 1660 1719 0.965678
ext_source_1 1370 1460 0.938356
nonlivingapartments_mode 1680 1852 0.907127
target_0 target_1 ratio
floorsmax_medi 1161 1430 0.811888
floorsmax_mode 1161 1430 0.811888
floorsmax_avg 1161 1430 0.811888
years_beginexpluatation_medi 1135 1404 0.808405
years_beginexpluatation_mode 1135 1404 0.808405
years_beginexpluatation_avg 1135 1404 0.808405
emergencystate_mode 1097 1373 0.798980
totalarea_mode 1116 1400 0.797143
ext_source_3 452 572 0.790210
name_type_suite 9 13 0.692308
In [96]:
@run
@cached_chart()
def eda_dfs_ratio_nas_after_dfs():
    plt.xlim(0, 3)
    plt.title("Ratio of NAs between target=0 and target=1 (close to 1 is best)")
    return sns.histplot(na_comparison["ratio"], color=moonstone)
Loading from cache [./cached/charts/eda_dfs_ratio_nas_after_dfs.png]

It seems that for the vast majority of features, on a dataset that is artificially balanced, the % of NAs is the same, regardless of the value of the TARGET label.

On the number of rare situations where the % is far from 1.0, it's mostly due to the low number of cases.

Based on the data we have, we expect our model not to gain additional insights from the creation of new columns indicating "imputed but originally null" fields, so we have decided to skip this step in our data preprocessing pipeline.

Measuring the feature importance of the newly created features¶

The last thing we want to do before closing off the EDA part is answering this question about the manual feature engineering we did:

  • Are the manually created features also contributing to the model?
In [595]:
def enrich_knn(impute_neighbors: int, target_mean_neighbors: int) -> Pipeline:
    p_enrich = Pipeline(
        [
            (
                "drop_nas_req_annuity",
                clean.DropRowsWithNas(in_cols=["amt_credit", "amt_annuity"]),
            ),
            ("add_credit_annuity_ratio", credit_clean.AddCreditAnnuity),
            (
                "impute_external_scores_knn",
                credit_clean.KnnImputeCols(
                    ["ext_source_1", "ext_source_2", "ext_source_3"],
                    n_neighbors=impute_neighbors,
                ),
            ),
            (
                "add_target_mean_extsources",
                credit_clean.TargetMeanEncodeUsing(
                    cols=[
                        "ext_source_1",
                        "ext_source_2",
                        "ext_source_3",
                        "credit_annuity_ratio",
                    ],
                    n_neighbors=target_mean_neighbors,
                ),
            ),
        ]
    )
    return p_enrich

Let's try to perform target encoding, but using these 4 continuous, using their $k$ neighbors.

For now, we are using a few hundreds for the number of neighbors. These are arbitrary values chosen at random, we just want to see if this helps in creating a feature that enriches our dataset in a meaningful way. Yes, if this works, and we wanted to boost our performance, we could consider tuning these parameters with Optuna to find their sweet spots.

In [596]:
f = feats_balanced.copy()
f["target"] = f["target"].astype("int")

es_enriched = enrich_knn(100, 200).fit_transform(f.drop(columns="target"), f["target"])

Let's visualize how well, or not, our model did. We're trying to get an idea about how our model performed (smaller errors are better)

In [690]:
diff = es_enriched.join(f["target"])
diff["err"] = np.abs(diff["target"] - diff["mean_target"])
In [752]:
@run
@cached_chart()
def feat_eng_enriched_meantarget_after_extsources():
    return chart_utils.visualize_3d_cube(
        diff,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="mean_target",
        title="Spread of mean_target",
        elevation=30,
        rotation=30,
        palette="PuOr",
    )
Loading from cache [./cached/charts/feat_eng_enriched_meantarget_after_extsources.png]

The results are quite similar to our prior findings, except the range is less broad.

Let's also visualize the error for each of those points (what the real value is, vs what we predicted based on their neighbors):

In [748]:
@run
@cached_chart()
def feat_eng_enriched_err_after_extsources():
    return chart_utils.visualize_3d_cube(
        diff,
        dimX="ext_source_1",
        dimY="ext_source_2",
        dimZ="ext_source_3",
        target="err",
        title="Spread of ERR",
        elevation=30,
        rotation=30,
        palette="PuOr",
    )
Loading from cache [./cached/charts/feat_eng_enriched_err_after_extsources.png]

It seems that the errors are below 0.5 (which is promising). Let's drill down a bit further:

In [693]:
plt.title("error (target - mean target of neighbors)")
sns.histplot(diff, x="err", color=moonstone, binwidth=0.03, multiple="stack")
plt.axvline(x=0.5)
Out[693]:
<matplotlib.lines.Line2D at 0x7fe1532a2970>

It does seem that most of the values fall below 0.5, indicating that there is some information encoded in this field (better than random chance).

Let's check that. What? are you saying we could use a T-test?

Alright, let's give it a try:

  • Two samples = (the imputated mean using KNN 500, of the cases that have a target of 0 vs 1)
  • $H_0 = $ There isn't a statistically significant difference between these two groups
  • $H_1 = $ There is a statistically significant difference between these two groups
  • Significance Level = 0.05
In [694]:
stats.ttest_ind(target_mean_0, target_mean_1)
Out[694]:
Ttest_indResult(statistic=-27.915856712576, pvalue=2.0329687910962912e-159)

Since the pvalue is lower than our significance level, we must reject the null hypothesis.

But, there's a plot twist!

Unfortunately, we cannot use standard ttests because this distribution does not pass our requirements for "normality":

In [706]:
e = diff["err"]
stats.kstest((e - e.mean()) / e.std(), "norm")
Out[706]:
KstestResult(statistic=0.06461382859316356, pvalue=1.8422604526814796e-18)
  • $H_0 = $ Distribution is gaussian/normal
  • $H_1 = $ Distribution is not gaussian/normal

p_value < 0.05 - we must reject the $H_0$ - This distribution is not normal, based on the Kolmogorov-Smirnov test

Since Kolmogorov Smirnov suggests that the distribution is not normal, we cannot take the result of the first ttest (above).

Let's try to inspect it visually:

It also seems quite shifted, if we compare (with a cutoff point around 0.5)

In [707]:
sns.histplot(diff, x="err", color=moonstone, binrange=[0, 1], bins=2, multiple="stack")
plt.axvline(x=0.5)
Out[707]:
<matplotlib.lines.Line2D at 0x7fe14a915100>
In [708]:
half = np.sign(e - 0.5)
half.value_counts()
Out[708]:
-1.0    3162
 1.0    1802
Name: err, dtype: int64

We could use Wilcoxon, which does not require normality, but the data is not symmetric... so let's also ignore this:

  • Wilcoxon signed-rank test
  • $H_0 = $ The distribution does not have a median statistically significantly different from 0.5
  • $H_1 = $ The distribution has a median statistically significantly different from 0.5
In [709]:
stats.wilcoxon(e - 0.5)
Out[709]:
WilcoxonResult(statistic=3902853.5, pvalue=7.789745380488737e-111)

Again, even despite the minuscule p_value, we must ignore it, as we do not meet the required conditions for Wilcoxon: symmetry!

So far, it seems that all none of the tests have been able to give us an answer that satisfies us.

Let's leave statistical inference and try to calculate whether we should include this metric or not through an empirical test: Does it provide information that helps us predict our actual target?

We can use Mutual Information, to achieve this:

In [710]:
chart_utils.visualize_3d_cube(
    diff,
    dimX="ext_source_1",
    dimY="ext_source_2",
    dimZ="ext_source_3",
    target="mean_target",
    title="Spread of TARGET over pairs of EXT_SOURCE_*",
    elevation=30,
    rotation=30,
    palette="PuOr",
)
__

Validating the contribution of new features¶

In [746]:
@run
@cached_chart()
def eda_feat_importance_including_enriched_extsource():
    df = diff.drop(columns="err")
    num_feats = df.select_dtypes(exclude="category")
    cat_feats = df.select_dtypes(include="category")

    hyperparams = {"subsample": 0.5, "colsample_bytree": 0.5}

    imp = featsel.feature_selection_lgbm(
        df,
        "target",
        list(set(num_feats.columns) - {"target"}),
        list(set(cat_feats.columns)),
        objective="binary",
        hyperparams=hyperparams,
    )

    markers = ["mean_target", "credit_annuity_ratio"]

    p = sns.barplot(
        data=imp.head(25),
        y="feature",
        x="importance",
        color=moonstone,
    )
    for marker in markers:
        row_mean_target = imp[imp["feature"] == marker].index[0]
        x = imp[imp["feature"] == marker]["importance"]
        plt.plot(x + 5, row_mean_target, marker="<", color="orange", markersize=8)

    return p
Loading from cache [./cached/charts/eda_feat_importance_including_enriched_extsource.png]

It seems that the 2 new features we incorporated are actually contributing strongly to our model.

Let's keep them.

Next Steps¶

In this file:

  • ✅ we have prepared our datasets: we have split our data into 4 different datasets for train/hp_tuning/val/test
  • ✅ we have explored our data to get a general understanding of what it looks like
  • ✅ we have found ways to engineer new features that helped our predictive power.
  • ✅ we have used tools from Inferential Statistics and Information Theory to help us understand which features are more likely to contribute to our models

In the next notebook, we will build a few artifacts that will help us automate this data preprocessing (data cleaning, DFS, outlier detection and removal, etc..).

Whenever you're ready, follow me into the next notebook!